Gabriel Krisman Bertazi
August 27, 2020
Reading time:
Linux 5.2 was released over one year ago and with it, a new feature was added to support optimized case-insensitive file name lookups in the Ext4 filesystem - the first of native Linux filesystems to do it. Now, one year after this quite controversial feature was made available, Collabora and others keep building on top of it to make it more and more useful for system developers and end users. Therefore, this seems like a good time as any to take a look on why this was merged, and how to put it to work.
More recently, f2fs has started to support this feature as well, following the Ext4 implementation and framework, thanks to an effort led by Google. Most, if not all, of the information described here also applies to f2fs, with small changes on the commands used to configure the superblock.
A file name is a text string used to uniquely identify a file (in this context, "directory" is the same as a file) at a specific level of the directory hierarchy. While, from the operating system point of view, it doesn't matter what the file name is, as long as it is unique, meaningful file names are essential for the end user, since it is the main key to locate and retrieve data. In other words, a meaningful file name is what people rely upon to find their valuable documents, pictures and spreadsheets.
Traditionally, Linux (and Unix) filesystems have always considered file names as an opaque byte sequence without any special meaning, requiring users to submit the exact match of the file to find it in the filesystem. But that is not how humans operate. When people write titles, "important report.ods" and "IMPORTANT REPORT.ods" usually mean the same piece of data, and you don't care how it was written when creating it. We care about the content and the semantics of the words IMPORTANT and REPORT.
In English, the only situation where different spelling of a word mean the same thing is when dealing with uppercase and lowercase, but for other languages, that is not the case. Some languages have different scripts to represent the same information and it makes sense, for a user, to not care about which different writing system the file was titled originally, when retrieving the data later.
Most of these linguistic differences have been solved by userspace applications in the past, but bringing this knowledge into the kernel allow us to resolve important bottlenecks for applications being ported from other operating systems, like windows Games, who cannot be simply recompiled to understand it is running on Linux and that the filesystem is now case-sensitive. In fact, making the kernel understand the process of language normalization and casefolding allow us to optimize our disk storage, such that the system can quickly retrieve the information requested. The end result is clear: a much more user-friendly Linux experience for end-users and a much better platform to run beloved Windows games with Steam on Linux.
⚠ This is very important.
Before enabling it, make sure your kernel supports case-insensitive Ext4, and that the encoding version you plan to use is supported.
The kernel supports case-insensitive Ext4 if it was built with CONFIG_UNICODE=y. If you are not sure, you can verify it on a booted kernel by reading the sysfs file below. If it doesn't exist, case-insensitive was not compiled into your kernel.
$ cat /sys/fs/ext4/features/casefold
Currently, the kernel supports UTF-8 up to version 12.1. mkfs will always choose the latest version, but attempting to run a filesystem with a more recent UTF-8 version than the kernel supports is risky, and to preserve your data, the kernel will refuse to mount such filesystem. To solve this issue, a kernel update is required or mkfs can be configured to use an older version.
A patch is queued for next release for the kernel to report on sysfs the latest supported revision of unicode. Notice that the following file might not be available in your system, even if CONFIG_UNICODE exists.
$ cat /sys/fs/unicode/version
First of all, make sure you've read the section "Before enabling". Failing to follow those instructions may render your filesystem unmountable in your current kernel.
To enable the feature, it takes two steps: one is to enable the filesystem-wide casefold feature on the volume's superblock. This doesn't immediately make any directories case-insensitive, so don't worry, but it prepares the disk to support casefolded directories. It also configures what encoding will be used.
The second step is to configure a specific directory to be case-insensitive. But first, let's see how to create a disk supporting case-insensitive.
When creating a filesystem, you need to set the casefold feature in mkfs:
$ mkfs -t ext4 -O casefold /dev/vda
After that, when mounting the filesystem, you can verify that the filesystem correctly has the feature:
$ dumpe2fs -h /dev/vda | grep 'Filesystem features' dumpe2fs 1.45.6 (20-Mar-2020) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg casefold sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
The feature is enabled in the filesystem in /dev/vda if the line above includes the feature 'casefold'.
Alternatively, you can mount the filesystem and check dmesg for the mount line:
$ mount /dev/vda /mnt $ dmesg | tail EXT4-fs (vda): Using encoding defined by superblock: utf8-12.1.0 with flags 0x0 EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: (null)
From the output above, vda was mounted with case-insensitive enabled and utf8-12.1.0.
Historically, any byte, other than the trailing slash ('/') and the null byte ('\0'), is a valid part of filenames. This is because Unix filesystems see filenames as a sequence of slash-separated components that are just opaque byte sequences, without any meaning assigned to them. Higher level userspace software give them meaning by seeing them as characters for rendering. When talking about case-insensitive, nevertheless, the kernel needs to inspect and understand what a character really is and what the rules are for case-folding. That is the reason we adopt an encoding in the kernel, like we did with UTF-8. But, for any encoding one may choose, the requirements of what a valid name is, is much more strict. In fact, there are several sequences that are simply invalid text in UTF-8. When a program asks the kernel to create a file with those names, the kernel needs to decide whether pretend the name is valid somehow or to throw an error to the application.
The vast majority of applications don't care about case-insensitiveness, and expect a filename to just be accepted, as long as it is a valid Unix name. These applications will fail if the kernel throws an error on what they expect is a valid name, so by default, if an application tries to use an invalid name on a case-insensitive directory, the kernel will just let it happen, and treat that single file as an opaque byte sequence. This is fine, but case-insensitive will not work for that file only.
There are cases, on the other hand, were we want to be strict on what is accepted by the filesystem. Having bad filenames mixed with good ones is confusing, and open space for programs to misbehave. For those users, though, ext4 has an strict mode, which causes any attempt to create or rename a file with a bad name to fail and return an error to the application.
To build an Ext4 filesystem with strict mode enabled, use:
$ mkfs -t ext4 -O casefold -E encoding_flags=strict /dev/vda
If everything went fine and tune2fs returned without any errors, next time you mount this filesystem your kernel log will show something like the line below:
$ mount /dev/sda1 /mnt $ dmesg | tail EXT4-fs (sda1): Using encoding defined by superblock: utf8-12.1.0 with flags 0x0
It has two important pieces of information. The first, is the encoding used which, in the example above, is UTF-8 supporting the version 12.1.0 of the Unicode specification. The second piece information is the flags argument, in this case 0x0, which modifies the behavior of the filesystem when dealing with casefolding directories.
At the time of this writing, the only flag supported is the Strict mode, in which case the flag mask would be 0x1.
After mounting a case-insensitive enabled filesystem, it is now possible to flip the 'Casefold' inode attribute ('+F') in empty directories to make the lookup of files inside them case-insensitive:
$ mkdir CI_dir $ chattr +F CI_dir
With that setting enabled, the following should succeed, instead of the last command returning "No such file or directory."
$ touch CI_dir/hello_world $ ls CI_dir/HELLO_WORLD
The directory case-sensitiveness can be verified using lsattr. For instance, in the example below, the F letter indicates that the CI_dir directory is case-insensitive.
$ lsattr . -------------------- ./CS_dir ----------------F--- ./CI_dir
To revert the setting, and make CIdir case-insensitive once again, the directory must be emptied, and then, the Casefold attribute removed:
$ rm CI_dir/* $ chattr -F CI_dir $ lsattr . -------------------- ./CS_dir -------------------- ./CI_dir -------------------- ./lost+found
It is a bit annoying to require the directory to be empty to flip the case-insensitive flag, but that is a technical requirement at the moment and unlikely to change in the future. In fact, to make the data of a case-insensitive directory accessible in a case-sensitive manner, it would be much easier to move it to a new directory:
$ mkdir CS_dir $ mv CI_dir/* CS_dir/ $ rm -r CI_dir
Would have a similar effect, from a simple point of view.
The Casefold flag recurses into nested directories. Therefore:
$ mkdir CI_dir $ chattr +F CI_dir $ mkdir CI_dir/foo $ lsattr CI_dir ----------------F--- CI_dir/foo
It is possible to mix case-insensitive and case-sensitive directories in the same tree:
$ mkdir CI_dir $ chattr +F CI_dir $ mkdir CI_dir/foo $ chattr -F foo $ lsattr . ----------------F--- CI_dir $ lsattr CI_dir -------------------- CI_dir/foo
Remember however, in the examples above, the order of commands matter, since a directory cannot have its Casefold attribute flipped if it is not empty.
Currently, only UTF-8 encoding is supported, and I am not aware of plans to expand it to more encodings. While different encodings make a lot of sense for Eastern languages speakers for encoding compression reasons, I'm not aware of anyone currently working on it for Linux.
With that said, the Linux implementation performs the Canonical Decomposition normalization process before comparing strings. That means that canonically equivalent characters can be correctly searched using a different normalized name. For instance, in some languages like German, the upper-case version of the letter ß (Eszett), is SS (or U+1E9E ẞ LATIN CAPITAL LETTER SHARP S). Thus, it makes sense for a German speaker to look for a file named "floß" (raft, in English), using the string "FLOSS":
$ touch CI_dir/floß $ CI_dir/FLOSS
There are also multiple ways to combine accented characters. Our method ensures, for instance that multiple encodings of the word café (coffee, in portuguese) can be interchangeable on a casefolded lookup.
Let's see something cool. For this to work, you might want to copy-paste the command below, instead of typing it. Let's create some files:
$ touch CI_dir/café CI_dir/café CS_dir/café CS_dir/café
How many files where created? Can you explain it?
The case-insensitive feature as implemented in Ext4 is a non-intrusive mechanism to support this feature for those who need it, while minimizing impact to other applications. Given the per-directory nature, it is safe to enable the feature bit filesystem-wide and let applications enable it on directories as needed. It is simple to use and should yield higher performance for user space applications that previously had to emulate it in userspace.
Hopefully, we will soon see this feature being enabled by default for distro kernels.
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
12/06/2024
Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…
Comments (46)
Esmil:
Aug 28, 2020 at 09:14 AM
How do you know the language the filename is written in?
As an example the files i.txt and I.TXT are usually "the same", but only if you know they're not written in Turkish.
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Aug 30, 2020 at 06:50 PM
Esmil,
Thanks for the comment!
The implementation doesn't address locales, the kernel performs casefolding based on unicode's canonical normalization (NFD) with slight modifications. You can find details of the exact semantics in the in-tree documentation.
Reply to this comment
Reply to this comment
Markus:
Aug 31, 2020 at 07:06 AM
How does Unicode normalization relate to locale differences? IIUC, those are different concepts.
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Aug 31, 2020 at 04:00 PM
Hi Markus,
Once again, the implementation doesn't address locales.
Reply to this comment
Reply to this comment
Esmil:
Aug 31, 2020 at 11:27 AM
I see, thanks.
TLDR: it uses Unicode's casefolding F (full) which just doesn't work correctly for Turkic languages.
Reply to this comment
Reply to this comment
Érico Nogueira:
Aug 28, 2020 at 12:44 PM
Hi, great article! I guess not much changed ever since your talk in LinuxDev BR!
There's a typo in the `mv CI_dir/*` command, you're missing the destination directory.
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Aug 28, 2020 at 05:04 PM
Thanks Érico, fixed!
Reply to this comment
Reply to this comment
David Day:
Aug 28, 2020 at 04:33 PM
Ummm,why? So you can be like Windows? This is just dumb. It would make scripting and automation a nightmare, like spaces in filenames. This is a solution looking for a problem to solve.....
Reply to this comment
Reply to this comment
Cy:
Aug 29, 2020 at 11:17 AM
Windows NTFS is case sensitive though. It's just limit insensitive through user space.
How is it make things confusing? Isn't it the opposite? I get the part makes the script much 'dirty' but I won't say it's a nightmare to match naming tho, that's be case sensitive thing.
Reply to this comment
Reply to this comment
David Day:
Aug 29, 2020 at 06:39 PM
The fact that anyone depends on case insensitivity implies poor / lazy coding. Even for day to day use, this isn't really a problem. Software does not load file... Simply spell the file name right. Explicit is better than implicit, and all that. Let's hope this isn't enabled by default in distros. Again, a solution looking for an actual problem.
Reply to this comment
Reply to this comment
Matt Sharp:
Dec 13, 2020 at 11:39 PM
I think you have got this completely backwards. I'm going to assert that case *in*sensitivity is the feature in search of a problem. ".Xauthorrity" and ".xauthority" being separate files benefits who? Any system which relies on this is needlessly confusing to users or anyone who didn't write that system. It flies in the face of the principle that a system should do the expected thing.
This feature was developed because there is actually a problem with case sensitivity: using a case sensitive filesystem is a terrible user experience. When I'm tab completing, who does it benefit for me to have to remember whether decided their dotfolder/dotfile started with an upper case? When I'm cleaning out an entire dir except for one item and I run rm [a-r]* [t-z]* to just keep the "s" files, who does it benefit that all the files starting with upper case get kept (or maybe they don't depending on LC_COLLATE). This is all just pointless extra steps for the user at the alter of purity. Personally I would rather have upper case banned in file names rather than keep the status quo.
Thankfully the designers of DNS learned the lesson that unix designers didn't. I don't want to live on an internet where gmail.com, Gmail.com, and GMail.com are all separate domains. Or paypal.com, Paypal.com, PayPal.com are all separate. You get the idea.
(just to head it off as there as some people who read any criticism of *nix as though it must be written by a *nix hater or MS fanboy, I generally think *nix does most things better than windows, however I think case sensitivity in filenames is one thing *nix got wrong and windows got right)
Reply to this comment
Reply to this comment
Sriram:
Jan 30, 2021 at 04:21 PM
> Any system which relies on this is needlessly confusing to users or anyone who didn't write that system. It flies in the face of the principle that a system should do the expected thing.
Perhaps, the actual issue is with the English language itself. Why have 26 more characters which do exactly the same thing as the other 26? Statistically the CAPs are insignificant in frequency (only occurs for syntactic value eg. Start of a sentence or Paragraph or Acronyms) and adds zero value semantically.
Your example of the DNS being case insensitive is right. But the usage is different. I will give a counter example of wiki page names, where Case Sensitiveness gives more meaning semantically, also preserves the linguistic references. It also helps to identify a resource (wiki page) while the DNS works like just a case insensitive search feature for the web.
Moreover a file name seems to be very personal to the user of the system than any other software. That said, adding a case insensitive search feature to the kernel is all welcome, but why change the FileSystem to fit to one single edge usecase.
Reply to this comment
Reply to this comment
J Nimmo:
Aug 18, 2023 at 07:23 AM
".Xauthorrity" and ".xauthority" being separate files benefits who?"
Anyone competent, for a start. You don't want your browser to load up configuration files from ".coomfig/Maaacrosoft/eDgEing"
Reply to this comment
Reply to this comment
Craig Barkhouse [MSFT]:
Sep 11, 2020 at 05:13 AM
NTFS natively is case insensitive, i.e. names are stored case insensitively on disk and lookups are necessarily case insensitive. An individual create/open can, if a flag is set in the operation, be performed case sensitively. It's fairly easy if things are stored case insensitively on disk to perform a case sensitive lookup; you just position to the first case-insensitive match and then walk forward looking for the exact case. Recent versions of Windows also have a dubious feature whereby individual directories can be marked as case sensitive, basically the opposite of what is described in this blog post.
Reply to this comment
Reply to this comment
Aaron Clausen:
Aug 29, 2020 at 01:05 AM
So we get forty year old bad design choices from CP/M and DOS filename choices imported into the Linux kernel.
Reply to this comment
Reply to this comment
debuggerboy:
Aug 29, 2020 at 09:22 AM
Not against it, those who want tk use it go ahead with it.
Let's make a mix of both. Those who enabled it, let them suffer the consequences.
While others can continue to use Linux the way it should be. That is Case Sensitive. I am not sure when we learn POSIX we are taught about case sensitive filesystems. The author's intended it to be case-sensitive.
I don't want it on any of my servers. I don't want to break my environment for some case insensitive psycho.
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Aug 30, 2020 at 06:56 PM
debuggerboy,
This is the way this feature is meant to be used. It is not something to be enabled on the root of your system, and I actually made sure to make this not possible. This feature is meant to be used per-directories where it is needed, for applications that require this behavior and are aware of it. Userspace solutions are possible for a lot of problems solved in the kernel, but in this case, they are inherently racy and cannot performance like a kernelside, disk-aware implementation.
The optional nature and per-directory approach will not break any existing deployment, while allowing applications that need this feature to co-exist in your system.
Reply to this comment
Reply to this comment
Rob:
Aug 29, 2020 at 09:43 AM
OS-level filename case-insensitivity is a handicap, not a feature. The wrong OS was changed.
Reply to this comment
Reply to this comment
David Day:
Aug 29, 2020 at 06:41 PM
How is it a handicap? By allowing you to have poor coding style, or having to use the correct file name?
Reply to this comment
Reply to this comment
Chuck Davis:
Aug 29, 2020 at 02:03 PM
There are a few reasons to prefer Linux over the competition. One of them has been case sensitive file names. This is a HUGE step backwards.
Reply to this comment
Reply to this comment
Me:
Aug 29, 2020 at 05:01 PM
Seems like this would be a major regression if it were ever enabled in a mainstream distribution. I just don't understand the point. This is currently handled wonderfully in user-space. I use case-insensitive searching and auto-completion on zsh and have never had an issue.
It sounds like it might be a more important issue in some non-Latin languages, and I would certainly support working to solve the issue for those users, but not everyone.
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Aug 30, 2020 at 07:44 PM
Hi,
Maybe this point deserves some clarification.
By enabling it in current distros, the article means enabling the feature to be used in the filesystem (mkfs -O casefold), which prepares the volume to use encoding and allow the casefold inode attribute to be set by the user where desired. There is no intention/plans to make the root of the filesystem or any mounted volume have the casefold inode attribute set by default. That wouldn't make sense.
Nevertheless, it is important that distros adopt the feature flag at filesystem creation time since with the current e2fsprog it cannot be set after creation. I have patches to make tune2fs support setting the bit afterwards, but they are still under discussion.
Enabling the encoding feature system-wide doesn't have any immediate impact on the behavior of the volume nor visible impact on performance. The only immediate concern is that, since this is an _INCOMPAT feature, the volume won't be mountable by older kernels ( < 5.2 )
Reply to this comment
Reply to this comment
Krish:
Aug 30, 2020 at 03:52 AM
This is a silly reason to go and build this whole semantic in the kernel, also consideringany.languages don't have the concept of case. The use case of "finding your file semantically" could have easily been handled by case insensitive search of filenames.
Will be an IRRITANT if it is enabled for folks familiar with the POSIX case sensitivity.
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Aug 30, 2020 at 07:15 PM
Krish, thanks for the comment.
Please note it is not a matter of the user finding the file semantically, that is a comment on how real languages and human work. There are real use cases that need this feature, for instance compatibility reasons with other enviroments, like the Android userspace that expose this feature, Samba servers, and emulation of windows applications over Linux.
The reason why this semantics needs to be done in the kernel is that it cannot be done efficiently in userspace, and such implementations are inherently slow and/or racy. For instance, it helps to have knowledge of how the directory entries are stored in the disk to perform lookups efficiently, and the filesystem itself has that information.
Reply to this comment
Reply to this comment
Krish:
Aug 30, 2020 at 03:57 AM
What is the need to hack the kernel to implement this retrograde validation? The stated use case "find my files by semantic" can be easily achieved by a case insensitive search of file names. This will be an IRRITANT to users coming from POSIX if activated without their knowledge, or even WITH some notice.
Reply to this comment
Reply to this comment
tim:
Sep 01, 2020 at 12:22 PM
Thanks for your work on this. Maybe it will help Steam and Wine, which I use occasionally. For sure it will help samba, which I have adminned in the distant past.
You have designed it so that users who don't need it can ignore it. The kernel is full of features that I don't use and therefore ignore, and I can't be alone in that, so I don't understand the negative comments here. It sounds like a fascinating project, and I wish you the joy of the bug fixing to come :)
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Sep 01, 2020 at 04:13 PM
Tim,
I'm happy to hear this is going to be useful for you! Let us know how it goes and please, report any issues you found!
Reply to this comment
Reply to this comment
John Wiersba:
Sep 04, 2020 at 07:17 PM
What if a "rogue" filesystem is mounted (say, via a thumbdrive) which enables case-insensitivity but has purposeful encoding errors in the names stored on disk, such as non-canonical encodings. Are there any chances for an exploit this way?
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Sep 04, 2020 at 07:31 PM
John,
That is a very open question. There is a whole class of exploits that attempt to instrument corrupted filesystems to exploit the kernel, abusing lack of checks in the filesystem code. In that sense, it would be a security issue, and a real instance of this would be a filesystem bug. That said, I don't think this feature makes such attack more or less likely. For non-canonical filenames, for instance, someone could possibly instrument the filesystem to occlude a dentry using another dentry, tricking the system to think it is writing to X but actually write to Y, but if an attacker has this kind of access to the disk, there are other ways they could obtain the same result that don't require an encoding-aware filesystem.
We take measures to prevent filename collisions in case-insensitive ext4 by, for instance, preventing a case-insensitive volume from being mounted case-sensitive, but a corrupted filesystem can always occur. fsck will detect that and fix it.
Security issues appearing on new features are always a possibility, but unless you have a specific attack model in mind, I think this is the most generic answer I can give :)
Reply to this comment
Reply to this comment
Somebody:
Sep 16, 2020 at 07:32 PM
So if I have a directory /home/me/testdir/
It has 2 files in it;
FILE1
file1
I also have a case-insensitive filesystem mounted at /mnt/
cp -f /home/me/testdir/* /mnt/
What do I end up with?
Only one of those 2 files, presumably with the first one's name (based on the order they are picked up by cp), and the second one's contents.
That's bad.
VERY bad.
And secondly, lets say that the original path has 10 thousand files in it and I want to find both of them....
find . -iname "file1"
Since I can search in a case insensitive manner in a case sensitive filesystem, what is the advantage?
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Sep 16, 2020 at 07:40 PM
> So if I have a directory /home/me/testdir/ It has 2 files in it;
>
> FILE1 file1
>
> I also have a case-insensitive filesystem mounted at /mnt/
>
> cp -f /home/me/testdir/* /mnt/
>
> What do I end up with? Only one of those 2 files, presumably with the
> first one's name (based on the order they are picked up by cp), and the
> second one's contents.
>
> That's bad. VERY bad.
That's just regular semantics. If you rename a file over a file
with the same name, you overwrite it. In a case-insensitive
filesystem, FILE1 and file1 are the same name. The only problem is
your expectations of a Case-sensitive filesystem while you are
using a case-insensitive one. Same thing if you use vfat,
APFS, etc.
> And secondly, lets say that the original path has 10 thousand files in
> it and I want to find both of them.... find . -iname "file1"
>
> Since I can search in a case insensitive manner in a case sensitive
> filesystem, what is the advantage?
Lookup performance.
Reply to this comment
Reply to this comment
Somebody:
Sep 16, 2020 at 08:20 PM
vfat? What year is this? 1985?
So based on that, the expectation is that this WILL destroy data.
And while lookup performance may be *marginally* better, again what year is this? 1985?
For sh**s and giggles, 'time find . -iname "file1"' just pulled those 2 files out of a million in a complex directory heirarchy in under 1 second (which is way better than you'd get from a 286 you'd find in 1985), so I don't buy the lookup performance argument.
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Sep 16, 2020 at 10:17 PM
> vfat? What year is this? 1985?
- APFS has optional case-insensitive support (released in 2017)
- HFS+ has optional case-insensitive support (1998, used up till APFS)
- NTFS has optional case insensitive support (Still used in every windows machine).
- exfat(2006) is case-insensitive (merged this year in Linux), default fs for large SD cards
So, basically any Windows, Mac, Iphone or Android device exposes some kind of case-insensitive filesystem support to userspace. it is 2020.
> So based on that, the expectation is that this WILL destroy data.
Yes. The same as "mv a b" will overwrite file b if it exists in any Unix filesystem you can find. If you care about it, you use "mv -n".
> And while lookup performance may be *marginally* better, again what year is this? 1985?
> For sh**s and giggles, 'time find . -iname "file1"' just pulled
> those 2 files out of a million in a complex directory heirarchy
> in under 1 second (which is way better than you'd get from a
> 286 you'd find in 1985), so I don't buy the lookup performance
> argument.
This test means absolutely nothing.
For starter, the parameter is not 1985. you don't get to say "under 1 second" is good for a single lookup. Then, a single arbitrary lookup of a small ASCII string in a large directory means absolutely nothing for performance analysis.
While you are there, "Complex directory hierarchy" doesn't matter at all. Create a single directory with a bunch of files on it.
Then your benchmark can be something like: what is the cost of performing N lookups of inexact case randomized filenames with average filename size S in a directory with M files, and compare a userspace solution to a kernel solution. You'll realize you need to start caching getdents in userspace to get even usable performance (not as good. just better). And now you deal with concurrent filename creations/deletions in your cache, which is a whole other issue. Finally, add utf-8 casefolding to the mix.
Then, also check the cost of verifying a file doesn't exist.
If you don't want to write the code to do all this in userspace to compare, i did it. Just google libcasefold.
Reply to this comment
Reply to this comment
Fresno Bob:
Dec 17, 2020 at 10:20 PM
I appreciate that you've put a lot of work into this.
That said, I concur with many others that it's an absolutely terrible idea.
Reply to this comment
Reply to this comment
Freso:
Jan 11, 2021 at 08:38 AM
> In English, the only situation where different spelling of a word mean the same thing is when dealing with uppercase and lowercase
So "colour" and "color" mean two different things? "Aluminium" and "aluminum"? "Acknowledgment" and "acknowledgement"?
I love that casefolding has made it into the kernel (unlike seemingly most other commenters), but that statement just seems outright wrong from a linguistic standpoint, unless I’m missing something… :)
Reply to this comment
Reply to this comment
Gabriel Krisman Bertazi:
Jan 11, 2021 at 05:11 PM
>I love that casefolding has made it into the kernel (unlike seemingly most other commenters), but that statement just seems outright wrong from a linguistic standpoint, unless I’m missing something… :)
Freso,
I think you are completely right. The point I made in the article at the time was an incorrect oversimplification to explain that other languages have many types of complexities, arising from accented characters , marks , etc. But you are correct. At the end of the day, filesystems and applications, at different layers, will make arbitrary decisions about what means what. The case-insensitive problem is just one of this arbitrary decisions, but with a practical reasoning - compatibility with other APIs and operating systems.
Reply to this comment
Reply to this comment
Josh Frank:
Feb 23, 2021 at 07:15 PM
I'm really grateful this has been added! Unlike most of the haters here, it seems, even though no one is forced to use this. I need to support a piece of software that has a bug in it that can cause data corruption/loss in some cases due to casing mismatches in filenames. It isn't anything I can control or fix upstream. This will let me put the data in a case-sensitive folder and give me that extra security and peace of mind.
+1000! Some people live in the real world and must deal with real world problems. It's like the people who say there's no point to supporting filenames with whitespace because its lazy / bad coding practice. Maybe that's true if you're only writing code on your own computer that never needs to do anything important. This is a big help for people who need to support less than ideal code in the real world.
Thank you!
Reply to this comment
Reply to this comment
David Santos:
Mar 09, 2021 at 02:30 AM
I think a lot of the arguments against it boil down to an aversion to change. Maybe some of these people have spent years defending case-sensitive filesystems as if they were inherently superior to case-insensitive ones, and now they'll have to explain why the Linux kernel added this feature. That might embarrass them, so they try to come up with scenarios to criticize the feature.
“What if I have files named FooBar and foobar in the same directory […]” do you, though? Why? And how is that more common or how is preserving the ability to have that more important than all the times something unintended happened because a file name was unwittingly specified with the wrong case? “I can remember the correct case of the files I use”, so what? Why is that good or important? Are there that many scenarios where the case of a file name matters?
I can think of several common use-cases where this will help or at the very least not hinder, but I can't think of a single scenario that I have met more than once where a case-insensitive filesystem would be a hindrance. I can't wait to update the kernel on my servers so I can start using it.
Reply to this comment
Reply to this comment
Mihai Moldovan:
Mar 12, 2021 at 03:16 PM
The comments are probably more enlightening than the article itself.
Instead of actually reading the article and (broadly) understanding the way this feature was implemented (first, fully optional and disabled by default in the global file system scope; secondly even if enabled in the file system superblock it won't have any consequences UNLESS it's being disabled explicitly by users on specific directories only), most comments just seem to skim the post and then ask why "bad design choices are now being implemented in the Linux kernel and users will just have to accept it".
Granted, case insensitivity IMHO is a bad design choice. It was implemented in some file systems like FAT*, APFS, HFS(+), NTFS, exFAT, and, frankly, all those file systems are still being used today, despite it not being 1985 any longer. Especially FAT* will, however, still continue to be used for quite a long time because it's easy to implement, widely supported, avoids additional writes to the device due to the lack of journal and there is almost no reason to change to a different file system. The most intriguing argument is the single file size limitation on FAT*, which might see a shift to exFAT, but I doubt that even in 5 years most USB flash drives, SD cards and the like will come pre-formatted with exFAT (unless, of course, the average USB flash drive size will exceed 2 TiB, which is quite unlikely). Even if that switch was made, it wouldn't help with case insensitivity. Case insensitive file systems are here to stay and all arguments brought up in the comments before are true and apply to them, but we cannot change that retrospectively. Even less so with compatibility in mind, which is a big deal.
The reason why I believe case insensitivity to be a bad design choice is because it adds a lot of complexity for the small benefit of easier access handling in user space. Yes, it's more comfortable not to have to care about "test1.txt" vs. "Test1.txt" vs. "tEsT1.tXt" (etc.), but the complexity involved in the kernel to achieve that probably outweighs this benefit.
Case sensitivity is way easier to implement. As the article explained, you just compare raw bytes. Do they match? Great, it's the file you've been looking for. Don't they? Bad luck, it must be a different file. The sweetness in this concept lies in the fact that semantics (the meaning of letters and words [which could be a problem for languages that don't use the concept of "words are made out of letters" but instead map words to specific pictograms, removing the difference between a letter and a word] in this context) need not to be taken into account in the file system driver.
For implementing case insensitivity, you'll have to support at least parts of semantics. In the bad old days, for example when thinking about FAT*, case insensitivity was implemented naïvely, by using a mapping within a code page. Code pages included 256 characters, so the mappings could be easily held in memory and be defined once in a (more or less) unambiguous manner. For instance, Turkish users used a different code page than mid-European users and hence could get their correct semantics in case folding.
With Unicode seeing widespread usage, and multiple, incompatible encodings, normalization and case folding rules being established, things got complicated. One example of this effect is that Apple decided to be helpful and implement a custom version of the unicode normalization form D and always decompose and recompose characters. This was incompatible to any other file system behavior and they later admitted that it was a mistake, which was fixed in APFS later on, in a way compatible to other file system implementations.
While case folding/mapping can be implemented using multiple lookup tables, getting it right is difficult, as other comments pointed out already. Turkish is a prime example of how the generic rules can lead to wrong results. The Unicode standard has provisions for this, called Locale-dependent Case Mappings (5.18), which would be required to get it really correct. However, passing locales to the kernel is non-trivial and at least the Linux kernel doesn't really support locales. Locale support is implemented in the user space, the kernel is mostly agnostic to it (and, so far, this hasn't been an actual problem).
The current implementation is a trade-off between functionality and complexity. I'm not sure how other modern file systems like NTFS and APFS implement case mappings, though they may suffer from the same limitations.
Nevertheless, having case insensitivity support is a forward step, not a step backwards, when it comes to compatibility. Especially software like wine can benefit greatly from it. Often, you're not able to change the way applications access files, and while wine can emulate case sensitivity, there are still enough situations that make things break. One example I was hit by was an archive that was created on Windows and supposed to replace files in a directory. However, for some reason, the files differed in casing, so instead of overwriting some files, they were duplicated as (to the file system) different, unique files. This caused issues with the application, that turned out to load the "wrong", original, all-lowercase file. Having had a way to set case insensitivity for that directory would have immensely helped me. Back then, I resorted to rename/overwrite the original files manually.
One aspect that makes me wary of enabling this feature and which hasn't been mentioned in the article explicitly (albeit in a later command here), is that it doesn't seem to be fully transparent and compatible. If enabled, such a volume won't be mountable with Linux kernels prior 5.2, which do not support this feature. This could be very frustrating. Sometimes, I do need to resort to older kernel versions (for instance rescue systems), which would make the data totally inaccessible, even if no directory on the file system uses the features or if you don't intend to modify such directories. I do understand that, for instance, running fsck on a volume that doesn't understand this feature could lead to very bad breakage, but force-mounting it read-write with older kernel data to modify data that isn't marked as case insensitive might be necessary sometimes and, as far as I can think of, not all that dangerous, if you promise to pretty-please never meddle with case insensitive data on the volume.
I guess that older kernel versions will just refuse to mount the volume, though, without a way to force it? This would make rescue operations from older systems impossible.
Reply to this comment
Reply to this comment
numzero:
Apr 11, 2021 at 10:55 AM
The next step is converting LF to CRLF on the fly.
Seriously, this is a bigger can of worms one can even imagine. Will it case-fold “a” and “а”? Will it case-fold “maßstab” and “MASZSTAB”? Etc. etc. And primarily, it can’t be updated. Once a case-insensitive filesystem is created, case folding can’t be changed without the risk of breaking it. So... create a case-insensitive filesystem, update the kernel, create another one–and have a nice day with two incompatible case-insensitive filesystems!
> When people write titles, "important report.ods" and "IMPORTANT REPORT.ods" usually mean the same piece of data, and you don't care how it was written when creating it.
Yet the kernel doesn’t work with titles. It works with file names, i.e. locally unique identifiers. The title could be stored in an xattr, for example (and will be able to contain “/” btw, and will not need to bear the “extension”).
>> Since I can search in a case insensitive manner in a case sensitive filesystem, what is the advantage?
>Lookup performance.
But most file managers load the whole directory contents before showing it anyway. And good ones do support locale-aware case-insensitive search/lookup. Many will also read the .directory file in each directory to show the proper icon (some were reading title from there as well IIRC).
> applications being ported from other operating systems, like windows Games
> a much better platform to run beloved Windows games with Steam on Linux.
So here is the reason. But... Windows™ games won’t run without Wine, and Wine could case-fold the filename before even sending it to the kernel. Case-fold in a Windows™-compatible locale-aware (IIRC) manner. That will not be case-preserving by itself but if really wanted, xattrs can be used to store original form. (and Z: should work as a case-sensitive filesystem, these are available in Windows™ too).
Reply to this comment
Reply to this comment
Amelia:
Nov 15, 2021 at 03:54 PM
CRLF is bullshit only Microsoft still uses, because Microsoft always makes bullshit. Any sane person will have adopted LF by now, because there's not reason to use two bytes when you can use one.
Reply to this comment
Reply to this comment
Davi Medrade:
Nov 15, 2021 at 04:04 PM
Tell that to people who indent with spaces.
Reply to this comment
Reply to this comment
Manuel:
Dec 06, 2021 at 01:25 AM
It is often said, that in german ß equals SS, but that is simply not true. For some reason unicode adopted this as a definition, but just because "ß" is generally considered as being a lowercase letter. But since there is no situation in which an uppercase "ß" would be needed, "SS" ist just an incomplete replacement. There are even Situations in which it can mean the exact opposite, for example you could consume something "in Massen" or "in Maßen". The first one refers to a great amount, while the second one means a small one.
Btw, the reason why case insensitive file systems are considered a bad design choice, is that an identifier should have exactly one canonical represantion, not the amount of work needed to implement it otherwise. This even has security implications, if, for example you configure a server to deny acces to a certain path, that could be circumvented by just using a different representation. String comparisons just don't work anymore.
Reply to this comment
Reply to this comment
Matt:
Dec 14, 2021 at 12:48 AM
You are talking like case insensitive file systems haven't existed in production use for a couple of decades. I'm sure there have been a few bugs, but we haven't been drowning in filesystem related canonicalization bugs. You have always needed to canonicalize paths for blocking access to a resource, if you relied on string comparison you were going to fail at security anyway.
Reply to this comment
Reply to this comment
Manuel:
Dec 14, 2021 at 04:40 AM
Yes we have, and also in an encoding nightmare before the invention of unicode. Even today I see more than enough distorted texts as the plain 8-bit encodings still haunt us. Mistakes of the past, like case-insensitive file systems. They should have been deprecated years ago. In an ideal world, every text would be ASCII and english would be the only language in existence. Thats impossible to reach, but a least in the computer world, we should strive to not constantly repeat our mistakes.
Reply to this comment
Reply to this comment
Gwyneth Llewelyn:
Feb 02, 2024 at 08:50 AM
Having just gone through a case-sensitivity nightmare, I can certainly feel grateful that at least someone has given some thought to this :)
Anyway, for what's worth, I just wanted to remind the haters that "ext4" is not the non-plus-ultra of filesystems, the "last filesystem that ever needed to be written", the Opus Magnus That No Human Being May Touch. Having used Linux since the days when "ext" didn't even have a version number, I'm used to filesystems and assumptions on filesystems that change over time. There might be a "ext5" filesystem in development that is natively case-insensitive — and getting direct support for it at the kernel level makes all the sense to me.
I just love when people complain about a new feature that is OPTIONAL — meaning no-one is forced to use it — while not understanding that "more options is always better". Or, at least, some sort of feature that enables more options is better. We live on Planet Earth where English is not the only language and ASCII is not enough. One might argue how important it is to have Tengwar or Klingon supported on Unicode — but then we should also argue why so many code points have been devoted to emojis :-o (see, text emojis work great, we don't need images to convey expression). The point is: Unicode, and more specifically UTF-8 (and to a certain degree, UTF-16 as well), gave us a plethora of new options, allowing us to dump the multiple ISO encoding schemes, and, even better, the ancient and archaic "Windows Code Pages" concept. The world is a better place these days when we have "one encoding for all" — meaning that anyone will have the option to use their alphabet, not merely what ASCII allows or doesn't.
But Unicode degrades gracefully to ASCII, so, if all you wish to see is ASCII, you can. That options was not closed. We just gave people the option of having 70,000+ code points on a "universal" alphabet, as opposed to being limited to just 127 and being forced to write in English.
Alas — I digress. The point here is that, as time passes, I expect that software development gives us more options, not less; conversely, whatever "comes next" should have as little interference with the legacy systems as possible.
That said, I'm a great fan of the Principle of Least Surprise. Thus, those running Linux kernels and ext4 filesystems should safely assume that their filesystem is strongly case sensitive. Nothing wrong with that assumption, nor that it should be the default (I agree: it should!). But obviously I'm quite open to the possibility of having more options beyond the "default" — even if I might not really see a personal use-case for many of those options, but that's just me, of course.
In fact, in two weeks or so, I'm pretty sure that I will forget everything written here. But that's fine as well., since I can be reasonably confident that even the automatic kernel updates won't break my existing configurations … but in the future, I expect to get at least a warning on the logs explaining why case sensitivity is causing problems on my mounted filesystem (which presumably has been around for a long, long time), and how to fix/migrate things — not a message saying "this filesystem only works with 7-bit-ASCII English words".
Reply to this comment
Reply to this comment
Add a Comment