Java – breaks a paragraph into a single sentence Am I covering all the bases here?
I'm trying to split a string containing multiple sentences into a string array of single sentences
This is what I have so far,
String input = "Hello World. " + "Today in the U.S.A.,it is a nice day! " + "Hurrah!" + "Here it comes... " + "Party time!"; String array[] = input.split("(?<=[.?!])\\s+(?=[\\D\\d])");
This code works very well Oh, I see
Hello World. Today in the U.S.A.,it is a nice day! Hurrah! Here it comes... Party time!
I use the look behind function to see if the ending punctuation sentence precedes a or a space If so, we broke up
However, this regular expression does not cover some exceptions For example, the United States is a great country, wrongly divided into the United States, is a great country
Any ideas on how to solve this problem?
And did I miss any edge cases here?
Solution
If you don't have to use regular expressions, you can use Java's built-in breakiterator
The following code shows an example of parsing a sentence, but breakiterator supports other forms of parsing (word, line, etc.) If you work with different languages, you can also choose to pass in different locales This example uses the default locale
String input = "Hello World. " + "Today in the U.S.A.,it is a nice day! " + "Hurrah!" + "The U.S. is a great country. " + "Here it comes... " + "Party time!"; BreakIterator iterator = BreakIterator.getSentenceInstance(); iterator.setText(input); int start = iterator.first(); for (int end = iterator.next(); end != BreakIterator.DONE; start = end,end = iterator.next()) { System.out.println(input.substring(start,end)); }
This results in the following outputs:
Hello World. Today in the U.S.A.,it is a nice day! Hurrah! The U.S. is a great country. Here it comes... Party time!