Java – the word cooccurrence in the sentence

I have a large set of sentences (10000) in a file The file contains one sentence per file In the whole set, I want to find the words in a sentence and their frequency

example sentence:

"Proposal 201 has been accepted by the Chief today.","Proposal 214 and 221 are accepted,as per recent Chief decision","This proposal has been accepted by the Chief.","Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.","Proposal 214,ValueMania,has been accepted by the Chief."};

I want to write the following output I should be able to provide three starting words as program parameters: "chief, accepted, proposal"

Chief accepted Proposal            5
Chief accepted Proposal has        3
Chief accepted Proposal has been   3

... 
...
for all combinations.

I know the combination can be big

I searched the Internet but couldn't find it I wrote some code, but I couldn't understand it Maybe someone who knows the domain name may know

ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();

            try {
                String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
                for (String t : tmp){
                      String[] keys = t.split(" ");
                      String[] uniqueKeys;
                      int count = 0;
                      System.out.println(t);
                      uniqueKeys = getUniqueKeys(keys);
                        for(String key: uniqueKeys)
                        {
                            if(null == key)
                            {
                                break;
                            }           
                            for(String s : keys)
                            {
                                if(key.equals(s))
                                {
                                    count++;
                                }               
                            }
                            System.out.println("Count of ["+key+"] is : "+count);
                            count=0;
                        }
                }
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

private static String[] getUniqueKeys(String[] keys) {
        String[] uniqueKeys = new String[keys.length];

        uniqueKeys[0] = keys[0];
        int uniqueKeyIndex = 1;
        boolean keyAlreadyExists = false;

        for (int i = 1; i < keys.length; i++) {
            for (int j = 0; j <= uniqueKeyIndex; j++) {
                if (keys[i].equals(uniqueKeys[j])) {
                    keyAlreadyExists = true;
                }
            }

            if (!keyAlreadyExists) {
                uniqueKeys[uniqueKeyIndex] = keys[i];
                uniqueKeyIndex++;
            }
            keyAlreadyExists = false;
        }
        return uniqueKeys;
    }

Can anyone help with coding?

Solution

You can apply standard information retrieval data structures, especially inverted indexes This is how you did it

Consider your original sentence Use some integer identifiers to number them as follows:

For each pair of words you encounter in a sentence, add it to an inverted index that maps the pair to a collection of sentence identifiers (a set of unique items) For sentences of length N, there are n - choose - 2 pairs

The appropriate Java data structure will be map < string, map < string, set < integer > > Arrange the pairs alphabetically so that the "yes" and "suggestions" pairs appear only ("yes", "suggestions") and not ("suggestions", "yes")

This map will contain the following:

"has","Proposal" --> Set(1,5)
"accepted",2,"has" --> Set(1,3,5)
etc.

For example, the word pairs "has" and "proposal" have a set (1,5), meaning that they are found in sentences 1 and 5

Now suppose you want to find the number of co - occurrences of words in the accepted, yes, and proposal lists Generate all pairs in this list and intersect their respective lists (using Java's set. Retainall()) The result here will finally be set to (1,5) Its size is 2, which means that two sentences contain "accepted", "yes" and "proposal"

To generate all pairs, simply iterate the map as needed To generate all word tuples of size n, you need to iterate and use recursion as needed

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>