Because then you need to take care everywhere to decode it as needed and also make sure you never double-encode it.
For example, do other servers receive it pre-encoded? What if the remote instance doesn’t do that, how do you ensure what other instances send you is already encoded correctly? Do you just encode whatever you receive, at risk of double encoding it? And generally, what about use cases where you don’t need it, like mobile apps?
Data should be transformed where it needs it, otherwise you always add risks of messing it up, which is exactly what we’re seeing. That encoding is reversible, but then it’s hard to know how many times it may have been encoded. For example, if I type & which is already an entity, do you detect that and decode it even though I never intended to because I’m posting an HTML snippet?
Right now it’s so broken that if you edit a post, you get an editor… with escaped HTML entities. What happens if you save your post after that? It’s double encoded! Now everyone and every app has to make sure to decode HTML entities and it leads to more bugs.
There is exactly one place where it needs to encode, and that’s in web clients, more precisely, when it’s being displayed as HTML. That’s where it should be encoded. Mobile apps don’t care they don’t even render HTML to begin with. Bots and most things using the API don’t care. They shouldn’t have to care because it may be rendered as HTML somewhere. It just creates more bugs and more work for pretty much everyone involved. It sucks.
Now we have an even worse problem is that we don’t know what post is encoded which way, so once 0.19 rolls out and there’s version mismatches it’s going to be a shitshow and may very well lead to another XSS incident.
It still leads to unsolvable problems like, what is expected when two instances federate content with eachother? What if you use a web app to use a third party instance and it spits out unsanitized data?
If you assume it’s part of the API contract, then an evil instance can send you unescaped content and you got an exploit. If you escape it you’ll double escape it from well behaved instances. This applies to apps too: now if Voyager for example starts expecting pre-sanitized data from the API, and it makes an API call to an evil instance that doesn’t? Bam, you’ve got yourself potential XSS. There’s nothing they can do to prevent it. Either it’s inherently unsafe, or safe but will double-escape.
You end up making more vulnerabilities through edge cases than you solve by doing that. Now all an attacker needs to do is find a way to trick you into thinking they have sanitized data when it’s not.
The only safe transport for user data is raw. You can never assume any user/remote input is pre-sanitized. Apps, even web ones, shouldn’t assume the data is sanitized, they should sanitize it themselves because only then you can guarantee that it will come out correctly, and safely.
This would only work if you own both the server and the UI that serves it. It immediately falls apart when you don’t control the entire pipeline from submission to display, and on the fediverse with third party clients and apps and instances, you inherently can’t trust anything.
Sorry for the late reply, but the point is that there is no trivial way to detect whether and how many times something has been encoded. You may end up with multiple levels of encoding in multiple systems and everything becomes untractable. Morever, as i said this doesn’t have to be a problem, as you can just decode everything as much as you can BEFORE you put it in the db, as the db can handle all of that by itself. Just let it do its job. Paradoxically, if you use only channels that support utf8 and don’t apply any transformation, your data is already perfect as it is. Then it is the job of the client to do what it needs to be able to render properly, but for instance a non-html client shouldn’t need to use html libraries to be able to strip html stuff from the text before it can be displayed.
What exactly makes storing it encoded a bad idea? A waste of space perhaps.
Because then you need to take care everywhere to decode it as needed and also make sure you never double-encode it.
For example, do other servers receive it pre-encoded? What if the remote instance doesn’t do that, how do you ensure what other instances send you is already encoded correctly? Do you just encode whatever you receive, at risk of double encoding it? And generally, what about use cases where you don’t need it, like mobile apps?
Data should be transformed where it needs it, otherwise you always add risks of messing it up, which is exactly what we’re seeing. That encoding is reversible, but then it’s hard to know how many times it may have been encoded. For example, if I type
&
which is already an entity, do you detect that and decode it even though I never intended to because I’m posting an HTML snippet?Right now it’s so broken that if you edit a post, you get an editor… with escaped HTML entities. What happens if you save your post after that? It’s double encoded! Now everyone and every app has to make sure to decode HTML entities and it leads to more bugs.
There is exactly one place where it needs to encode, and that’s in web clients, more precisely, when it’s being displayed as HTML. That’s where it should be encoded. Mobile apps don’t care they don’t even render HTML to begin with. Bots and most things using the API don’t care. They shouldn’t have to care because it may be rendered as HTML somewhere. It just creates more bugs and more work for pretty much everyone involved. It sucks.
Now we have an even worse problem is that we don’t know what post is encoded which way, so once 0.19 rolls out and there’s version mismatches it’s going to be a shitshow and may very well lead to another XSS incident.
That’s a problem of not conforming to any standard. Not with it being a bad idea in general, like say storing passwords in plaintext is.
It still leads to unsolvable problems like, what is expected when two instances federate content with eachother? What if you use a web app to use a third party instance and it spits out unsanitized data?
If you assume it’s part of the API contract, then an evil instance can send you unescaped content and you got an exploit. If you escape it you’ll double escape it from well behaved instances. This applies to apps too: now if Voyager for example starts expecting pre-sanitized data from the API, and it makes an API call to an evil instance that doesn’t? Bam, you’ve got yourself potential XSS. There’s nothing they can do to prevent it. Either it’s inherently unsafe, or safe but will double-escape.
You end up making more vulnerabilities through edge cases than you solve by doing that. Now all an attacker needs to do is find a way to trick you into thinking they have sanitized data when it’s not.
The only safe transport for user data is raw. You can never assume any user/remote input is pre-sanitized. Apps, even web ones, shouldn’t assume the data is sanitized, they should sanitize it themselves because only then you can guarantee that it will come out correctly, and safely.
This would only work if you own both the server and the UI that serves it. It immediately falls apart when you don’t control the entire pipeline from submission to display, and on the fediverse with third party clients and apps and instances, you inherently can’t trust anything.
deleted by creator
Sorry for the late reply, but the point is that there is no trivial way to detect whether and how many times something has been encoded. You may end up with multiple levels of encoding in multiple systems and everything becomes untractable. Morever, as i said this doesn’t have to be a problem, as you can just decode everything as much as you can BEFORE you put it in the db, as the db can handle all of that by itself. Just let it do its job. Paradoxically, if you use only channels that support utf8 and don’t apply any transformation, your data is already perfect as it is. Then it is the job of the client to do what it needs to be able to render properly, but for instance a non-html client shouldn’t need to use html libraries to be able to strip html stuff from the text before it can be displayed.