Java network programming – URI and URL

premise

The previous article "Internet query in Java" analyzed how to determine the address of the host in the Internet through IP address or host name. There may be any number of resources on any given host. These resources need identifiers to facilitate hosts to access each other's resources. Therefore, this article deeply analyzes URLs and URIs.

URI

The full name of URI is uniform resource identifier, that is, uniform resource identifier. It is a string representation that uses a specific syntax to identify a resource. The resource identified by the URI may be a file on the server or an email address, book, host name, etc. Simply remember: a URI is a string identifying a resource (there is no need to worry about what the identified target resource is, because users generally do not see the resource entity). What is received from the server is only a byte representation of the resource (binary sequence, read from the network stream). The syntax of a URI consists of a pattern and a pattern specific part. The expression is as follows:

模式:模式特定部分

scheme:scheme specific part 

Note: the representation of a particular part of a schema depends on the schema used. Current common patterns of URIs include:

In addition, Java also uses a lot of non-standard customization patterns, such as RMI, jar, JNDI, Doc, JDBC, etc. these non-standard patterns are used to achieve different purposes.

There is no fixed syntax for the specific part of the pattern in the URI. However, it often adopts a hierarchical structure, such as:

//授权机构/路径?查询参数

//authority/path?query

The authority part of the URI specifies the authority responsible for resolving other parts of the URI. In many cases, the URI uses the Internet host as the authority. for example http://www.baidu.com/s?ie=utf -8. The authorized organization is www.baidu.com Com (from the perspective of URL, the host name is www.baidu. Com).

The path of the URI is the string used by the authorization authority to determine the identified resource. Different authorities may resolve the same path to different resources. In fact, this is obvious. Try writing two different projects. The path of the home page is / index HTML, they must be identical HTML documents. In addition, the path can be layered. Each part of the layered is separated by a slash "/", and "." And ".." Operators are used for navigation in the hierarchical hierarchy (the latter two operators may be rarely seen, just understand).

Syntax of URI

The pattern components of a URI can be lowercase letters, numbers, plus signs, dots (.) And hyphen (-). The other three parts of a typical URI (schema specific parts, i.e. authorization authority, path and query parameters) are composed of ASCII letters (i.e. letters A-Z, A-Z and numbers 0-9). In addition, punctuation -! And ~, while delimiters (such as /,?, & and =) have predefined uses. All other characters, including Latin letters in ASCII, need to be escaped with a percent sign (%). The escaped format is:% + characters are encoded in UTF-8 and then converted to hexadecimal string representation. Note that if the previously mentioned delimiter is not used as a delimiter, it also needs to be escaped. For a simple example, if the Chinese character "Mu" exists in the URI, and the UTF-8 code of Mu is 0xe6 0x9c 0xa8, it should be escaped as% E6% 9C% A8 in the URI. Urlencoders in JDK or related class libraries in Apache codec provide URI (URL) encoding.

The URI can also carry the user's password. Because there will be security vulnerabilities, it is not common, and analysis is not carried out here.

Uri class

URIs are abstracted in Java as Java net. Uri class.

Construct URI instance

There are many URI instance construction methods:

public URI(String str) throws URISyntaxException

public URI(String scheme,String userInfo,String host,int port,String path,String query,String fragment) throws URISyntaxException

public URI(String scheme,String authority,String fragment)throws URISyntaxException

public URI(String scheme,String ssp,String fragment) throws URISyntaxException

//静态工厂方法,最终调用的是URI(String str)
public static URI create(String str)

Note that when constructing a URI instance, it will check whether it conforms to the syntax of the URI, otherwise it will throw a urisyntaxexception exception. All the above methods construct URI instances based on the pattern + pattern specific part of the URI or each part of the URI. The static factory method public static URI create (string STR) mainly shields the non checked exception urisyntaxexception and converts it into a checked exception illegalargumentexception, so there is no need to explicitly catch exceptions.

Gets the property of the URI

As mentioned earlier, URI reference includes up to three parts: schema, schema specific part and fragment identifier. The general format is:

模式:模式特定片段:片段标识符

Based on this syntax specification, the URI class provides the following methods to obtain these attributes:

public String getScheme()

public String getRawSchemeSpecificPart()

public String getSchemeSpecificPart()

public String getFragment()

public String getRawFragment()

PS: the reason why there is no getrawscheme() method is that it is mandatory in the URI specification that all schema names must be composed of legal ASCII characters in the URI specification, that is, percentage escape is not allowed in the schema name. The above getrawschemespecicpart () returns the original schema specific part, and getschemespecicpart () returns the decoded schema specific part. Similarly, getrawfragment () returns the original fragment identifier, while getfragment () returns the decoded fragment identifier. Of course, there are other ways to get the underlying properties of the URI:

//是否绝对URI
public boolean isAbsolute()

//是否不透明的URI,如果isOpaque()返回true,URI是不透明的
//只能获取到模式、模式特定部分和片段标识符,获取不了host、port等
public boolean isOpaque()

public String getAuthority()

public String getRawAuthority()

public String getRawUserInfo()

public String getUserInfo()

public String getHost()

public int getPort()

public String getRawPath()

public String getPath()

public String getRawQuery()

public String getQuery()

When the PS: isopaque() method is true, it indicates that the URI is opaque, and the opaque URI cannot obtain the authorization authority, path, port, query parameters, etc. In addition, some of the above methods for obtaining attributes have corresponding methods of getrawfoo (). These getrawfoo () methods obtain the value of the original attribute. If there is no raw keyword, they return the decoded string value.

Resolve relative URI

Uri provides three methods for converting between absolute and relative URIs:

public URI resolve(URI uri)

public URI resolve(String str)

public URI relativize(URI uri)

Among them, the resolve method is to complete the relative URI into absolute URI based on absolute URI and relative URI, for example:

public static void main(String[] args) throws Exception{
	URI absolute = URI.create("http://localhost:8080/index.html");
	URI relative = URI.create("/hello.html");
	URI resolve = absolute.resolve(relative);
	System.out.println(resolve);
}
//控制台输出
http://localhost:8080/hello.html

The relativize method reverses the relative URI part of the absolute URI based on the absolute URI and relative URI, for example:

public static void main(String[] args) throws Exception{
	URI absolute = URI.create("http://localhost:8080/index.html");
	URI relative = URI.create("http://localhost:8080/");
	URI resolve = relative.relativize(absolute);
	System.out.println(resolve);
}

//控制台输出
index.html

Comparison of URIs

The URI class implements the comparable interface, so URIs can be sorted. Uri equality is not based on string direct comparison. Equal URIs must be consistent in transparency. For example, they are opaque, and other parts can be compared. When comparing URIs, the schema and authorization authority ignore case, and other parts must be case sensitive, except for hexadecimal digits used to escape invalid strings. Characters that need to be escaped will be considered as different URIs when compared before and after escape.

//1.
URI uri1 = URI.create("http://localhost:8000/index.html");
URI uri2 = URI.create("http://LOCALHOST:8000/index.html");
System.out.println(uri1.equals(uri2));
//输出:true

//2.
URI uri3 = URI.create("http://localhost:8000/index/A");
URI uri4 = URI.create("http://LOCALHOST:8000/index/%41");
System.out.println(uri3.equals(uri4));
//输出:false

String representation of URI

There are two methods in the URI that return their string representation:

public String toString()

public String toASCIIString()

The toString () method returns an uncoded string, that is, special characters are not escaped with a percent sign. Therefore, the return value of this method cannot be guaranteed to conform to the syntax of the URI, although its parts follow the syntax specification of the URI. The toassiistring () method returns the encoded string form (us-acsii encoding), that is, special characters must have been escaped by the percent sign. The meaning of toString () is to improve the readability of URIs, and the meaning of toassiistring () method is to improve the availability of URIs.

URL

The full name of URL is uniform resource location, that is, uniform resource location. In fact, a URL is a special URI. It not only identifies a resource, but also provides a specific network location for the resource. The client can obtain the resource corresponding to the URL through it.

The network resource location represented by the URL usually includes the protocol used to access the server (such as HTTP, FTP, etc.), the host name or IP address of the server, and the path of the resource file on the server. Typical URLs, for example http://localhost/myProject/index.html , which indicates that there is a directory named index. In the myproject directory of the local server HTML document, which can be accessed through the HTTP protocol (in fact, the URL does not necessarily refer to the real physical path in the server, because we generally deploy applications in the server, such as servlet applications. The URL access is likely to be the application interface, and the final mapped resources can be determined by the application itself).

Syntax of URL

The syntax of the URL is:

protocol://userInfo@host:port/path?query#fragment

协议://用户信息@主机名:端口/路径?查询#片段

Protocol: the protocol in the URL is another name corresponding to the schema in the URI. In the URL, the protocol part can be file, FTP, HTTP, HTTPS, magnet, telnet or other custom protocol strings (but not urn).

Userinfo: the user information (userinfo) in the URL is the login information of the server, which is optional. If this part of information exists, it will generally contain a user name and rarely a password. In fact, it is unsafe for URLs to carry user information.

Port: the port number in the URL refers to the running port of the application in the server. The default port is 80. This part of information is optional (that is, if the port number is not specified, the default port 80 is used).

Path: the path in the URL is used to represent a specific directory on the server (in fact, a specific file can also be used). This specific directory is not necessarily a physical directory, but also a logical directory. This is easy to explain. It is generally impossible to directly expose the directory on the server for everyone to access. Web (Java, usually servlet) applications run on the server. The actual data source pointed to by the path may even be the query results in MySQL on other servers.

Query: query parameter (query) is generally a string, which represents the additional parameters provided to the server in the URL. It is generally only used in the URL of HTTP protocol. It contains form data and comes from the user's input in the form of key1 = value1 & key2 = Value2 & Keyn = valuen.

Fragment: a fragment represents a specific part of a remote server resource. If the server resource is an HTML document, this fragment identifier will be formulated as an anchor of the HTML document. If the remote resource is an XML document, the fragment identifier is an xpointer. (if you have written a blog with markdown, you know that after adding the navigation directory, the fragment is the target chapter to which the directory will navigate.)

Relative URL

URL can tell the browser a lot of information about a document (assuming that the resources on the server corresponding to the URL are uniformly called documents): the protocol used to obtain the document, the host where the document is located, the path of the document in the host, etc. There may also be URLs in the document that reference the same URL as the current URL. Therefore, when using the URL in the document, it is not required to specify each URL completely. The URL can inherit the protocol, host name and path of its parent document. Incomplete URLs that inherit some information of the parent document URL are called relative URLs. On the contrary, URLs that completely specify all parts are called absolute URLs. In the relative URL, the missing parts are the same as those corresponding to the URL requesting the document. For example, we access an HTML document on the local server with the URL http://localhost:8080/myProject/index.html ,index. There is a hyperlink < a href = "login. HTML" > in the HTML document. When we click this hyperlink, the browser will start from the original URL( http://localhost:8080/myProject/index.html )Truncate index HTML and then splice login HTML, last access http://localhost:8080/myProject/login.html 。

If the relative URL starts with "/", it is relative to the root directory of the document, not the current document. For example, we access an HTML document on the local server with the URL http://localhost:8080/myProject/index.html ,index. There is a hyperlink < a href = "/ login. HTML" > in the HTML document. When we click this hyperlink, the browser will jump to http://localhost:8080/login.html 。

There are two significant advantages over URLs:

URL class

java. net. URL class (called URL directly later) is the JDK's unified abstraction of URL. It is a final modified class, that is, subclasses are not allowed. In fact, the policy mode is adopted in the URL design. The processors of different protocols are different policies, and the URL class constitutes the context, which determines which strategy to choose. The core attributes of URL include protocol (or mode), host name, port, query parameter string and fragment identifier (named ref in JDK). Each attribute can be set separately. Once a URL object is constructed, all its properties cannot be changed, that is, its instance is thread safe.

Construct URL instance

The main construction methods of URL instances are as follows:

//基于URL的各个部分构造URL实例,其中file相当于path、query和fragment三个部分组成
public URL(String protocol,String file) throws MalformedURLException

//基于URL的各个部分构造URL实例,其中file相当于path、query和fragment三个部分组成,使用默认端口80
public URL(String protocol,String file) throws MalformedURLException

//基于URL模式构造URL实例
public URL(String spec) throws MalformedURLException

//基于上下文(父)URL和URL模式构造相对URL实例
public URL(URL context,String spec) throws MalformedURLException

Here are some simple coding examples based on the above methods of constructing URL instances:

//1.
//注意file要带斜杆前缀/
URL url = new URL("http","127.0.0.1",8080,"/index");
//输出http://127.0.0.1:8080/index
System.out.println(url.toString());

//2.
URL url = new URL("http://127.0.0.1:8080/index");
//输出http://127.0.0.1:8080/index
System.out.println(url.toString());

//3.
URL url = new URL("http","/index");
//输出http://127.0.0.1/index
System.out.println(url.toString());

//4.
URL context = new URL("http","/index");
//构造相对URL,保留协议、host、port部分
URL url = new URL(context,"/login");
//输出http://127.0.0.1/login
System.out.println(url);

The above only talks about constructing objects through URL classes. In fact, there are other methods to obtain URL instances, such as:

URL systemResource = ClassLoader.getSystemResource(String name)

Enumeration<URL> systemResources = ClassLoader.getSystemResources(String name)

URL resource = UrlMain.class.getResource(String name)

URL resource = UrlMain.class.getClassLoader().getResource(String name)

Enumeration<URL> resources = UrlMain.class.getClassLoader().getResources(String name)

Where classloader Getsystemresource (string name) and classloader Getsystemresources (string name) first judges whether there is a systemclassloader. If so, use systemclassloader to load resources. Otherwise, use bootstrap classloader (bootstrap classpath) to load resources. In short, these two methods are to use system class loader to load resources. When loading resources, load resources from the class path, such as idea, Then load resources from the compiled / target directory. You can use classloader Getsystemresource ('') to validate. The other three methods class #getresource (string name), class #getclassloader() #getresource (string name) and class #getclassloader() #getresources (string name) are essentially resource loading based on appclassloader, When loading resources, they are loaded from the class path of the current class (including the package path of the class. If idea is used, it is generally the package directory of the / target / class). For example, there is a class club throwable. Main. Class, if the directory club Throwable has a picture Doge Jpg, you can load pictures like this: club throwable. Main. class. getResource("doge.jpg")。 It is worth noting that if the resources to be loaded are in a specific directory, the name in class #getresource (string name) must start with the file path separator. For example, '/' is used in the window system, and the other two directly loaded through the classloader instance do not need to start with the file path separator, This can be seen from the resolvename (name) method in the class #getresource (string name) method.

Gets the properties of the URL

URL instances provide several methods for obtaining data:

public final InputStream openStream() throws java.io.IOException

public URLConnection openConnection() throws java.io.IOException

public URLConnection openConnection(Proxy proxy) throws java.io.IOException

public final Object getContent() throws java.io.IOException

public final Object getContent(Class[] classes) throws java.io.IOException

InputStream openStream()

The openstream () method connects to the resource referenced by the URL. After completing the necessary handshake between the client and the server, it returns an InputStream instance for reading network stream data. The content read by the InputStream instance returned by this method is the original message of the HTTP request body (if the HTTP protocol is used), so it may be an original text fragment or a binary sequence (such as pictures). for instance:

	public static void main(String[] args) throws Exception{
		URL url = new URL("https://www.baidu.com");
		InputStream inputStream = url.openStream();
		int c;
		while ((c= inputStream.read()) != -1){
			System.out.print(c);
		}
		inputStream.close();
	}

URLConnection openConnection()

Openconnection () is similar to openconnection (proxy), except that the latter can use a proxy. The openconnection () method creates a new socket for the specified URL and returns a urlconnection instance, which represents an open connection of network resources. We can obtain an InputStream instance from the open connection to read network stream data. If the above procedure fails to call, an IOException will be thrown.

Object getcontent() object getcontent() and object getcontent (class [] classes) are similar. The latter can convert the request body content in the obtained URL into an entity of the corresponding type through the class array. Object getcontent() actually calls the getcontent method in the urlconnection instance. Generally speaking, these two methods are not recommended, because the results after the implementation of the conversion logic are generally inconsistent with our expectations.

Get URL properties

The attribute acquisition of URL can be understood as decomposing the URL into various parts of the URL, and the information of these parts can be obtained separately. The various components of the URL have been mentioned earlier. Repeat here. The URL is composed of the following parts:

The corresponding methods provided in the URL class are:


//获取模式(协议)
public String getProtocol()

//获取主机名
public String getHost()

//获取授权机构,一般是host:port的形式
public String getAuthority()

//获取端口号port
public int getPort()

//返回协议的默认端口,如http协议的默认端口号为80,如果没有指定协议的默认端口则返回-1
public int getDefaultPort()

//返回URL字符串中从主机名后的第一个斜杆/一直到片段标识符的#字符之前的所有字符
public String getFile()

//返回的值和getFile()相似,但是不包含查询字符串
public String getPath()

//返回URL的片段标识符部分
public String getRef()

//返回URL的查询字符串
public String getQuery()

//返回URL中的用户信息,不常用
public String getUserInfo()

Note GetFile () and getpath (), for example:

URL url = new URL("https://localhost:8080/search?name=doge#anchor-1");
System.out.println(String.format("path=%s",url.getPath()));
System.out.println(String.format("file=%s",url.getFile()));
//控制台输出
path=/search
file=/search?name=doge

compare

The comparison of URL instances usually uses the equals () and hashcode () methods. Two URLs are considered equal if and only if they point to resources on the same host, port and path, and their fragment identifiers and query strings are the same. When equals() is called, it will try to use DNS to resolve the host. This method may be a blocked IO operation, which will cause large performance consumption. At this time, it is necessary to consider using cache or converting URL into URI for comparison.

transformation

There are three common instance methods in the URL class to convert to another form:

public String toString()

public String toExternalForm()

public URI toURI()

In fact, tostring() finally calls toexternalform(), and toexternalform() uses StringBuilder to splice the components of the URL. The returned string can be used directly in the browser. The touri () method converts a URL instance into a URI instance.

Encoding and decoding of URLs

When we do web projects or use postman, we often see x-www-form-urlencoded, which is a common media type or content type. When using this type, the characters in the URL need to be encoded, so why do we need to encode it? This is a historical legacy because Unicode encoding was not fully popularized when the web or HTTP (s) protocol was invented. The characters used in the URL must come from a fixed subset of ASCII, which is:

Other characters such as: / &? @#; $+ = They can also be used, but they are limited to specific purposes. If these strings appear in the URL path or query string, they and the contents of the path and query string must be encoded.

Here is a point to note: the URL encoding is only for the URL path and the part after the URL path, because the URL specification stipulates that the part before the path must meet the content of a fixed subset of ASCII.

The URL encoding method is very simple: except ASCII numbers, letters and partially specified punctuation marks, all other characters should be converted to byte representation, and each byte should be converted to a percent sign (%) followed by 2 hexadecimal digits. Space is a special character, which is widely used. Generally, space can be encoded as% 20 or plus sign (+), but the plus sign itself is encoded as% 2B. And / #= & and? When not used as a delimiter, it must be encoded.

The decoding process is the reverse operation of the above encoding process, which is not specifically expanded here.

Two classes are provided in Java net. Urlencoder and Java net. Urldecoder is used for URL encoding and decoding respectively. Note that the static methods encode (string value, string charset) and decode (string value, string charset) with two parameters need to be used. The single parameter method has expired and is not recommended. Note that you are using Java net. Urlencoder and Java net. When using urlcoders, their APIs will not judge which parts of the URL need encoding and decoding and which parts do not need encoding and decoding. Throwing the whole URL string directly into the encoding will inevitably lead to unexpected results. Take an example:

String url = "http://localhost:9090/index?name=张大doge";
String encode = URLEncoder.encode(url,"UTF-8");
System.out.println(encode);
//输出:http%3A%2F%2Flocalhost%3A9090%2Findex%3Fname%3D%E5%BC%A0%E5%A4%A7doge

In fact, we only need to encode and decode the characters after path and path, such as URL http://localhost:9090/index?name= Zhang Da Doge, the only parts we need to encode and decode are index, name and Zhang Da Doge. Other parts should remain the same. The correct example is as follows:

public static void main(String[] args) throws Exception {
	String raw= "http://localhost:9090/index?name=张大doge";
	String base = raw.substring(raw.lastIndexOf("//"));
	String pathLeft = base.substring(base.lastIndexOf("/") + 1);
	String[] array = pathLeft.split("\\?");
	String path = array[0];
	String query = array[1];
	base = raw.substring(0,raw.lastIndexOf(path));
	path = URLEncoder.encode(path,"UTF-8");
	String[] queryResult = query.split("=");
	String queryKey = URLEncoder.encode(queryResult[0],"UTF-8");
	String queryValue = URLEncoder.encode(queryResult[1],"UTF-8");
	System.out.println(base + path + "?" + queryKey + "=" + queryValue);
}
//输出结果:http://localhost:9090/index?name=%E5%BC%A0%E5%A4%A7doge
//其中UTF-8编码中张的十六进制表示为E5 BC A0,大的十六进制编码为E5 A4 A7

Proxy

Many systems will access the resources in web applications or other servers through the proxy server. The proxy server receives the request from the local client and forwards the request to the remote server. Then the remote server returns the request result to the proxy server. After receiving the result, the proxy server will return the result to the local server. There are two important reasons for this:

In Java, except that TCP connections use the socket agent of the transport layer, other application layer agents do not support it. Java does not provide the option to disable the agent for socket, but the agent can be enabled and restricted through the following three system attribute configurations:

http.proxyHost:代理服务器的主机名,默认不设置此系统属性。
http.proxyPort:代理服务器的端口号,默认不设置此系统属性。
http.noneProxyHosts:不需要代理访问的主机名,多个用竖线|分隔,默认不设置此系统属性。

举个例子:
System.setProperty("http.proxyHost","localhost");
System.setProperty("http.proxyPort",1080);
System.setProperty("http.noneProxyHosts","www.baidu.com|github.com");

Proxy class

java. net. The proxy class provides finer grained control over the proxy server, that is, it allows different remote servers to be used as the proxy server in programming, rather than configuring the proxy globally through system properties. Proxy currently supports three Proxy types:

Use as follows:

SocketAddress socketAddress = new InetSocketAddress("localhost",80);
Proxy proxy = new Proxy(Proxy.Type.HTTP,socketAddress);
Socket socket = new Socket(proxy);
//...

Proxyselector class

Each running Java virtual machine will have a Java net. Proxyselector instance object, which is used to determine the proxy server used by different connections. The default is Java net. The implementation of proxyselector is sun net. spi. An instance of defaultproxyselector. It will check the protocols of various system properties and URLs, and then decide whether to connect to different remote proxy servers. Of course, developers can also inherit and implement custom Java net. Proxyselector, so that different proxy servers can be selected according to other criteria such as protocol, host, path date, etc. java. net. Several core abstract methods of proxyselector are as follows:

//获取默认的ProxySelector实例
public static ProxySelector getDefault()

//设置默认的ProxySelector实例
public static void setDefault(ProxySelector ps)

//通过URI获取可用的代理列表
public abstract List<Proxy> select(URI uri)

//告知ProxySelector不可用的代理和出现的异常
public abstract void connectFailed(URI uri,SocketAddress sa,IOException ioe)

If you need to expand, you'd better add the caching function to cache the list of available proxies. Once the proxy is unavailable, clean up and eliminate the unavailable proxy nodes through connectfailed.

Summary

URLs and URIs are important identifiers of resources in the current network world or system. Understanding their basic knowledge can better carry out network programming. A URI is a uniform resource identifier that can represent a resource in any medium. URL is a unified resource location, which generally refers to the network resource location in the Internet. It is used in HTTP or HTTPS protocol. Obviously, URIs can represent a wider range, and URIs actually contain URLs. The difference between the two can be seen in the above sections.

(c-5-d e-20181003)

The official account of Technology (Throwable Digest), which is not regularly pushed to the original technical article (never copied or copied):

Entertainment official account ("sand sculpture"), select interesting sand sculptures, videos and videos, push them to relieve life and work stress.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>