Java: I have a large string of HTML and need to extract href = “…” text

I have this string that contains a chunk of HTML. I try to extract the link from the href = "..." part of the string Href can take one of the following forms:

<a href="..." />
<a class="..." href="..." />

I don't have a real regular expression problem, but for some reason I use the following code:

String innerHTML = getHTML(); 
  Pattern p = Pattern.compile("href=\"(.*)\"",Pattern.DOTALL);
  Matcher m = p.matcher(innerHTML);
  if (m.find()) {
   // Get all groups for this match
   for (int i=0; i<=m.groupCount(); i++) {
    String groupStr = m.group(i);
    System.out.println(groupStr);

   }
  }

Can someone tell me what's wrong with my code? I did these things in PHP, but in Java, I did something wrong... What happened was that whenever I tried to print it, it would print the whole HTML string

Edit: so everyone knows what kind of string I'm dealing with:

<a class="Wrap" href="item.PHP?id=43241"><input type="button">
    <span class="chevron"></span>
  </a>
  <div class="menu"></div>

Every time I run the code, it prints the entire string... That's the problem

About using jtidy... I'm using it, but it's interesting to know what's wrong in this case

Solution

.*
.*

This is a greedy operation that will contain any character, including quotation marks

Try something similar:

"href=\"([^\"]*)\""
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>