Sane filenames, flexible page names

NTFS does not support filenames with''' / : * ? " < > ''' in it. Unix filesystems just have two characters, which are not allowed in a filename: slash and the NULL byte character, which ends a string - and so the filename. Well, for the NULL byte we really have no use, but why not use a slash in a mobiki page name?

I was thinking for a way to provide sane encoded but reversible filenames for the mobiki page names, which work on all systems. My actual idea was to exchange all critical characters to some identical looking unicode replacements. But unfortunately browsers may interpret this as a phishing attack.

Another idea would be to hex encode the UTF-8 page name and store it that way. The bad thing: You could not recognize anymore, what page is in what file on filesystem level. Well, I would like to keep that, if possible.

Since actually all major filesystems are Unicode compatible, my actual idea is to replace all critical ASCII characters above by unicode replacements in other unicode code pages and back, just for filesystem storage, not for URLs. The drawback is, that you could not use these unicode character anymore. my actual proposal:

" 	= 201f		‟
?	= 203e		‽
|	= 2223		∣
:	= 2236		∶
<	= 2039		‹
>	= 2040		›
/	= 2044		/
∖	= 2216		 ∖
*	= 2217		∗

Well, I did not take into account, that Windows supports Unicode, but not PHP running on Windows: It uses the fopen call and does not use or provide the wfopen calls needed to pass Unicode to the filesystem level. So I must think of another solution and drop that idea. :-(

What did I do?

I did choose the following solution: URL-encode the filename. So all ASCII letters and digits are 1:1, also dash and underscore. Most other characters, greece characters, brackets, german umlauts and such are hex-encoded. So the filesystem is safe and nobody can POST a prepared special filename to hack the site. If "normal" characters are used also everything looks well. I think, we can live with that solution. And I hope, that the next released standard binaries of PHP will make use of the filesystems Unicode capabilities by default.