import nltk from nltk.tokenize import sent_tokenize text="Don't hesitate to ask questions. Be positive." print(sent_tokenize(text))
2.2分词方法
TreebankWordTokenizer缩略词会被分离
1 2
words = nltk.word_tokenize(text) print(words)
PunktWordTokenizer通过分离标点来实现切分,每一个单词都会被保留
1 2 3 4
from nltk.tokenize import WordPunctTokenizer tokenizer=WordPunctTokenizer() words = tokenizer.tokenize(text) print(words)
2.3停止词
无用词被称为停止词。
看下面例子
1 2 3 4 5 6 7 8 9 10 11 12 13
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent) filtered_sentence = [w for w in word_tokens ifnot w in stop_words]
print(word_tokens) print(filtered_sentence)
2.4语料库
nltk自带的语料库
语料库操作
2.5提取词干
提取词干的原因是为了缩短查找的时间,使句子正常化。
考虑下面这种情况:
1 2
I was taking a ride in the car. I was riding in the car.
表达的意思都是我在车上,没有必要区分taking和ridinig
下面是提取相干单词的词干:
1 2 3 4 5 6 7
from nltk.stem import PorterStemmer from nltk.tokenize import sent_tokenize,word_tokenize
ps = PorterStemmer() example_words = ["python","pythoner","pythoning","pythoned","pythonly"] for w in example_words: print(ps.stem(w))
下面是实际中句子的提取词干:
1 2 3 4 5
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once." words = word_tokenize(new_text)
for w in words: print(ps.stem(w),end=" ")
2.6词性还原
和词干提取类似,不同之处在于词干提取会创造不存在的词汇,词性还原的结果是一个真正的词汇
1 2 3 4
from nltk.stem import WordNetLemmatizer lemmatizer= WordNetLemmatizer() print(lemmatizer.lemmatize('increases')) #输出结果为 increase
print(lemmatizer.lemmatize('playing', pos="v")) print(lemmatizer.lemmatize('playing', pos="n")) print(lemmatizer.lemmatize('playing', pos="a")) print(lemmatizer.lemmatize('playing', pos="r")) ''' 结果为 play playing playing playing '''
CC coordinating conjunction CD cardinal digit DT determiner EX existential there (like: "there is" ... think of it like "there exists") FW foreign word IN preposition/subordinating conjunction JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest' LS list marker 1) MD modal could, will NN noun, singular 'desk' NNS noun plural 'desks' NNP proper noun, singular 'Harrison' NNPS proper noun, plural 'Americans' PDT predeterminer 'all the kids' POS possessive ending parent's PRP personal pronoun I, he, she PRP$ possessive pronoun my, his, hers RB adverb very, silently, RBR adverb, comparative better RBS adverb, superlative best RP particle give up TO to go 'to' the store. UH interjection errrrrrrrm VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what WP$ possessive wh-pronoun whose WRB wh-abverb where, when
#标注函数 defprocess_content(): try: for i in tokenized[:5]: #分词 words = nltk.word_tokenize(i) #标注 tagged = nltk.pos_tag(words) print(tagged) except Exception as e: print(str(e)) process_content()
import nltk import random from nltk.corpus import movie_reviews #在每个类别(正向和负向)选取所有文件ID,然后对文件ID存储word_tokenized版本,之后再加上一个正面或者负面标签 documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] #因为前部分都是负面,后部分都是正面,所有打乱数据 random.shuffle(documents)
print(documents[1])
all_words = [] for w in movie_reviews.words(): #将大写转化为小写 all_words.append(w.lower()) #统计每个单词次数,FreqDist中的键为单词,值为单词的出现总次数。 #关于FreqDist的学习(https://blog.csdn.net/csdn_lzw/article/details/80390768) all_words = nltk.FreqDist(all_words) #找出出现频率最高的前15个单词 print(all_words.most_common(15)) #找出指定单词频数 print(all_words["stupid"])
接下来将为单词,储存正面或负面的电影评论的特征
2.8.1使用NLTK将单词转化为特征
根据上一节的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import nltk import random from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = []
for w in movie_reviews.words(): all_words.append(w.lower())
deffind_features(document): words = set(document) features = {} for w in word_features: features[w] = (w in words) return features #下面,我们可以打印除特征集 print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))