采用Nagao算法统计各个子字符串的频次,然后基于这些频次统计每个字符串的词频、左右邻个数、左右熵、交互信息(内部凝聚度)。
名词解释:
Nagao算法:一种快速的统计文本里所有子字符串频次的算法。详细算法可见http://.algo.word; public class Main { public static void main(String[] args) { //if 3 arguments, first argument is input files splitting with ',' //second argument is output file //output 7 columns split with ',' , like below: //word, term frequency, left neighbor number, right neighbor number, left neighbor entropy, right neighbor entropy, mutual information //third argument is stop words list if(args.length == 3) NagaoAlgorithm.applyNagao(args[0].split(","), args[1], args[2]); //if 4 arguments, forth argument is the NGram parameter N //5th argument is threshold of output words, default is "20,3,3,5" //output TF > 20 && (left | right) neighbor number > 3 && MI > 5 else if(args.length == 5) NagaoAlgorithm.applyNagao(args[0].split(","), args[1], args[2], Integer.parseInt(args[3]), args[4]); } }
以上所述就是本文的全部内容了,希望大家能够喜欢。