Stanford’s POS tagger is used in Java
•
Java
Mar 9,2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
Mar 9,2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: � (U+FFFD,decimal: 65533) Mar 9,decimal: 65533)
These are the errors I get when I want to assign POS tags to sentences I read sentences from the file At first (a few words) I didn't get the error (i.e. indecipherable), but after reading some sentences, the error appeared I use v2 0 (i.e. 2009), and the model is left3words
Solution
I agree with Yuval - a character encoding problem, but the most common case is that when the marker attempts to read a file in UTF-8, the file uses single byte encoding (such as iso-8859-1) See Wikipedia's ufffd discussion
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
二维码