Java – rfc3986 – which pchars require percentage encoding?
I need to generate a href to a URI Link to / some / path when it comes to reserved characters requiring percentage encoding; The element should appear as < a href = "/ some / path; element" > (I know the path; the element represents a single entity)
At first I was looking for a Java library to do this, but I finally wrote something myself (see what Java failed below, because this problem is not Java specific)
So RFC 3986 does suggest when not to encode This should happen when I read, when the character is not reserved (alpha / digit / "–" / "" / ")/ "~") class So far, it's very good But what about the opposite? The RFC only mentions that percentage (%) always needs coding But what about the others?
Question: assuming that everything is not unreserved, can / should it be percentage coding? Is it correct? For example, open the parentheses (encoding is not necessary, but semicolon; if I don't encode, I finally find / the first *; a href = "/ first (second" > I always look for / the first (the second, as expected). What puzzles me is that both (and; in the same sub delims class of RFC) I imagine that encoding all non unreserved is a safe bet, but seoability, User friendliness involves localized URIs?
Now, the Java library has failed I've tried this new Java net. URI(“http”,“site”,“/ pa; th”,null). Toassiistring(), but this gives http: / / site / PA; This is not good Similar results were observed:
> javax. ws. rs.core. Uribuilder > spring's uriutils – I've tried encodepath (string, string) and encodepathsegment (string, string)
[*] / first is to call HttpServletRequest on the server side when < a href = "/ first; second" > is clicked The result of getservletpath()
Editor: I may need to mention that this behavior is observed under Tomcat. I have checked that Tomcat 6 and 7 behave the same way
Solution
RFC 3986 states:
This means that you decide which delimiter (i.e. < delimiter >) character to encode according to the context Those that do not need coding should not be coded
For example, if it appears in the path component, you should not encode / A as a percentage, but when it appears in a query or fragment, you should encode it as a percentage
So, actually one; Characters (which are members of < reserved >) should not be automatically percentage encoded. In fact, Java URL and URI classes do not do so; Please refer to the URI (...) Javadoc, specifically step 7) to understand that the < Path > component is processed
This has been strengthened:
So this indicates a URL that contains a percentage encoding; Different from the URL containing raw The last sentence means that they should not be automatically encoded or decoded 100%
This brings us to the question - why are you coded
Sorry, but it doesn't match the semicolon. It should be escaped
In terms of URL / URI specification, there is no special significance It may have a special meaning for a specific web server / website, but generally speaking (i.e. without specific knowledge of the website), you can't know this
>If there is a special meaning in a particular URI, you will destroy this meaning if you escape it 100% For example, if the website is used; To allow the session token to be attached to the path, and then percentage encoding will prevent it from recognizing the session token... > if it is only some data characters provided by the client, you may change the meaning of the URI if you encode it Does this matter depending on the functionality of the server? That is, whether to decode, as part of the application logic
This means that knowing the "right thing" requires an in - depth understanding of what URIs mean to end users and / or sites This will require advanced thinking and reading techniques to implement My suggestion is to get the CMS by properly escaping any delimiter of the URI path before passing it to your software The algorithm must be aimed at CMS and content delivery platform It / they will respond to requests for documents identified by the URL and need to know how to interpret them
(people who support arbitrary use of arbitrary paths are a little crazy and must have some restrictions. For example, Windows does not even allow the use of file separators in file name components, so you must have some boundaries somewhere. It's just a matter of deciding where they should be.)