What is the difference between the gethost and getauthority methods in the URL class in Java?
I have a series of different forms of strings (URLs):
> http:// domain name. anything / anypath > https:// dmain name. anything / anypath > http://www.domain name. anything / anypath > https://www.dmain name. anything / anypath
These strings are saved in a CSV file I need to parse every URL to get the domain name Everything That is, the part after the first one Before the first /
I use the split method to separate strings, then convert each string to a URL, and then use the toauthority function to get only the domain name The problem is that for me, institutions and hosts do the same work, including what I don't want Although, in Oracle tutorial, it seems that toauthority should return a domain name without www
How to extract the domain name without www URL?
Solution
To really understand this, you should read URI specification – RFC 2396
The short answer is that the permission component consists of the host component and optional port number, user name and password... Depending on the URL scheme used
You call gethost () to test whether it starts with the string "www" If it, you delete it
But before you start doing this, you need to understand that deleting "www." may provide you with an invalid URL or resolve to a document or service different from the document or service to which the original URL is resolved Collecting URLs for free is a bad idea... Unless you know more about the organization of the website
"Foo. Com" and "www.foo. Com" are the same local conventions. They are just a convention. Many websites have not implemented it Deleting "www." would be a bad idea because it might convert a resolvable URL to an unresolved URL