基于关键词的文本排序检索系统

文章目录

一、问题描述
二、需求分析
三、TF-IDF模型的实现

（1）思路
（2）代码实现

（2.1）计算TF
（2.2）计算IDF
（2.3）计算TF-IDF

四、主函数的实现
五、其他函数的实现

（1）文本库加载函数
（2）文本库处理函数

（2.1）分词及停用词的处理

（2.1.1）分词
（2.1.2）停用词

（2.2）数据处理的主体部分
（2.3）dealDataSet()函数的完整源代码

（3）导出结果文本函数

六、源代码

一、问题描述

二、需求分析

???拿到题目首先就要知道什么是TF-IDF模型，TF-IDF是一个计算权值的算法，权值用于衡量关键词对于某篇文章的重要性（相关度），从而可以对指定关键词按照tf-idf值来对文本进行排序。起初为了方便，我准备使用python中的nltk库中的函数来完成分词和计算权值等操作，但是后来因为下载nltk中某个库失败，于是我就自己完成了tf-idf的代码实现，因为还需要完成停用词的处理，最后还是调用了nltk.corpus中的stopwards库。
在这里插入图片描述

三、TF-IDF模型的实现

（1）思路

???刚开始实现tf-idf计算了文本库中所有单词的所有tf-idf，小规模的文本库没什么影响，但是实际情况是文本库规模很大，用户没有输入的单词的数据多余，会大大增加时间复杂度和空间复杂度，所以经过思考我准备调整思路。最终的实现思路是:用户输入的检索单词，然后调用函数分别计算出此单词在各个文本的tf值，此单词在文本库中idf值，并计算出tf-idf值，从而实现了高效检索。

（2）代码实现

???代码实现涉及三个函数，分别计算tf值，idf值，还有一个函数调用前两个函数计算tf-idf值。这三个函数都是相同的形式参数，in_word是需要检索的单词，words_num_dic是一个文本库中所有文档对应的单词词数字典，结构：{txt1:{word1:num1,word2:num2},txt2:{word1:num3,word3:num4},…｝。

（2.1）计算TF

???计算tf值的步骤就是先计算各个文本的总次数，然后计算检索单词在各个文档中的出现的词数，再取两者的商，就是该单词的TF值，返回值是该检索词在各个文档的tf值的词典

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

def computeTF(in_word, words_num_dic):
"""
计算单词in_word在每篇文档的TF

:param in_word: 单词
:param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
:return: tfDict: 单词in_word在所有文本中的tf值字典｛文件名1：tf1,文件名2：tf2,...｝
"""
allcount_dic = {} # 各文档的总词数
tfDict = {} # in_word的tf字典
# 计算每篇文档总词数
for filename, num in words_num_dic.items():
count = 0
for value in num.values():
count += value
allcount_dic[filename] = count
# 计算tf
for filename, num in words_num_dic.items():
if in_word in num.keys():
tfDict[filename] = num[in_word] / allcount_dic[filename]
return tfDict

（2.2）计算IDF

???先计算出总文档数，再计算包含检索词的文档个数，对两者求商再取对数，对分母加1处理（目的是防止分母等于0）并返回该结果。一个单词的IDF值只与整个文本库有关，换言之，一个单词在固定文本库中的IDF值固定，所以返回结果是个数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

def computeIDF(in_word, words_num_dic):
"""
计算in_word的idf值

:param in_word: 单词
:param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
:return: 单词in_word在整个文本库中的idf值
"""
docu_count = len(words_num_dic) # 总文档数
count = 0
for num in words_num_dic.values():
if in_word in num.keys():
count += 1
return math.log10((docu_count) / (count + 1))

（2.3）计算TF-IDF

调用前两个函数，计算此单词在各个文档的tf-idf值，返回一个字典

1
2
3
4
5
6
7
8
9
10
11
12
13
14

def computeTFIDF(in_word, words_num_dic):
"""
计算in_word在每篇文档的tf-idf值

:param in_word: 单词
:param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
:return: tfidf_dic:单词in_word在所有文本中的tf-idf值字典｛文件名1：tfidf1,文件名2：tfidf2,...｝
"""
tfidf_dic = {}
idf = computeIDF(in_word, words_num_dic)
tf_dic = computeTF(in_word, words_num_dic)
for filename, tf in tf_dic.items():
tfidf_dic[filename] = tf * idf
return tfidf_dic

四、主函数的实现

???首先将文本库的所有文本内容加载到程序中，并将加载的数据进行处理（包括分词、去除停用词等操作）得到一个记录各文档中各单词词数的字典，和一个文本库的总词库。文本库处理完毕后，用户输入一个或多个关键词，将用户的输入保存在一个list中，然后对用户的输入进行分词处理，得到若干个关键词，如果其中存在关键词在文本词库中，那么就计算此关键词在各文本中的tf-idf值，并输出该关键词的按照tf-idf值降序排列的文本序列到result1.txt文件中；如果用户输入的所有关键词都不在词库中，就输出“无任何搜索结果”。一轮搜索结束后询问用户是否继续搜索，是则继续执行上述操作，并将输出保存在result2.txt，…以此类推；否则退出程序。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

if __name__ == '__main__':
# 载入文件
print("\t默认文本库路径为：D:/study/B4/data")
print("\t搜索结果文本路径为：D:/study/B4/result")
path = "D:/study/B4/data" # 文本库路径
all_docu_dic = loadDataSet(path) # 加载文本库数据到程序中
words_set, words_num_dic = dealDataSet(all_docu_dic) # 处理数据返回值1.文本词库（已去除停用词），2.各文本词数的词典
n = 0 # 记录搜索次数
a = -1 # 控制程序终止的变量
while a != 0:
in_words = input("搜索：")
input_list = re.split("[!? '. ),(+-=。:]", in_words)
k = 0 # 用于记录单次输入的有效关键词的个数
n += 1
for i in range(len(input_list)):
if input_list[i] in words_set:
k += 1
tfidf_dic = computeTFIDF(input_list[i], words_num_dic) # 单词的tfidf未排序字典
# 控制台输出
print("关键词:" + input_list[i])
print(sortOut(tfidf_dic)[0:5]) # 输出前五个相关文本
# 文本输出
text_save("result" + str(n) + ".txt", sortOut(tfidf_dic)[0:5], input_list[i]) # 将排序后的tfidf字典保存到文件中
if k == 0:
print("无任何搜索结果")
a = input("任意键继续搜索，'0'退出:")
print("-------------------------------------")

五、其他函数的实现

（1）文本库加载函数

???这一部分主要是读取文本库位置然后动态装载文本库，将文本内容传入程序中，最后返回一个文本库字典all_docu_dic（结构：｛文本名1：文本内容1，文本名2：文本内容2…｝）

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

def loadDataSet(path):
"""
读取文本库中的文本内容以字典形式输出

:param path: 文本库地址
:return: 文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
"""
# 将文件夹内的文本全部导入程序
files = os.listdir(path) # 得到文件夹下的所有文件名称
all_docu_dic = {} # 接收文档名和文档内容的词典
for file in files: # 遍历文件夹
if not os.path.isdir(file): # 判断是否是文件夹，不是文件夹才打开
f = open(path + "/" + file, encoding='UTF-8-sig') # 打开文件
iter_f = iter(f) # 创建迭代器
strr = ""
for line in iter_f: # 遍历文件，一行行遍历，读取文本
strr = strr + line
all_docu_dic[file] = strr.strip('.') # 去除末尾的符号.
print("文件库：")
print(all_docu_dic)
return all_docu_dic

（2）文本库处理函数

???由上一步文本库加载函数 loadDataSet()的加载，得到了一个字典类型的文本库。接下来就是就是通过这个函数来处理上面得到的数据。

（2.1）分词及停用词的处理

（2.1.1）分词

???语句的分词使用的re模块下的split函数实现自定义分词

1	cut = re.split("[!? '.),(+-=。:]", content) # 分词

（2.1.2）停用词

???停用词的处理需要调用nltk.corpus模块里的stopwords库，注释里还提供了停用词的扩展功能

1
2
3
4
5
6
7

stop_words = stopwords.words('english') # 原始停用词库
# #停用词的扩展
# print(len(stop_words))
# extra_words = [' ']#新增的停用词
# stop_words.extend(extra_words)#最后停用词
# print(len(stop_words))
new_cut = [w for w in cut if w not in stop_words if w] # 去除停用词，并且去除split后产生的空字符

（2.2）数据处理的主体部分

??此部分主要将分完词后的文本库处理得到文本库的词库all_words_set（结构：｛word1,word2,…｝）和文本词数字典words_num_dic （结构：｛txt1:{word1:num1,word2:num2},…｝）

1
2
3
4
5
6
7
8
9
10
11
12
13
14

# 计算所有文档总词库和分隔后的词库
for filename, content in all_docu_dic.items():
cut = re.split("[!? '.),(+-=。:]", content) # 分词
new_cut = [w for w in cut if w not in stop_words if w] # 去除停用词，并且去除split后产生的空字符
all_docu_cut[filename] = new_cut # 键为文本名，值为分词完成的list
all_words.extend(new_cut)
all_words_set = set(all_words) # 转化为集合形式

# 计算各文本中的词数
words_num_dic = {}
for filename, cut in all_docu_cut.items():
words_num_dic[filename] = dict.fromkeys(all_docu_cut[filename], 0)
for word in cut:
words_num_dic[filename][word] += 1

（2.3）dealDataSet()函数的完整源代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

def dealDataSet(all_docu_dic):
"""
处理文件库字典的数据

:param all_docu_dic:文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
:return: 1.all_words_set 文本库的词库｛word1,word2,...｝
2.words_num_dic 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
"""
all_words = []
all_docu_cut = {} # 分完词后的dic(dic嵌套list)

stop_words = stopwords.words('english') # 原始停用词库
# #停用词的扩展
# print(len(stop_words))
# extra_words = [' ']#新增的停用词
# stop_words.extend(extra_words)#最后停用词
# print(len(stop_words))

# 计算所有文档总词库和分隔后的词库
for filename, content in all_docu_dic.items():
cut = re.split("[!? '.),(+-=。:]", content) # 分词
new_cut = [w for w in cut if w not in stop_words if w] # 去除停用词，并且去除split后产生的空字符
all_docu_cut[filename] = new_cut # 键为文本名，值为分词完成的list
all_words.extend(new_cut)
all_words_set = set(all_words) # 转化为集合形式

# 计算各文本中的词数
words_num_dic = {}
for filename, cut in all_docu_cut.items():
words_num_dic[filename] = dict.fromkeys(all_docu_cut[filename], 0)
for word in cut:
words_num_dic[filename][word] += 1
# print("词库：")
# print(all_words_set)
print("文件分词库：")
print(all_docu_cut)
return all_words_set, words_num_dic # 返回词库和文档词数字典

（3）导出结果文本函数

???字典类型的变量没有办法直接导出到文本中，所以需要对字典类型的变量进行额外处理，非字符类型的变量需要使用str()函数转换成字符串后才能导入文本。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

def text_save(filename, data, word):
"""
对检索词word的字典输出到filename的文件中

:param filename:输出文本的文件名
:param data: 字典类型
:param word: 关键词
"""
fp = open("D:/study/B4/" + filename, 'a')
fp.write("关键词:" + str(word) + '\n')
for line in data:
for a in line:
s = str(a)
fp.write('\t' + s)
fp.write('\t')
fp.write('\n')
fp.close()

六、源代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171

import math
import os
import re
from nltk.corpus import stopwords

def loadDataSet(path):
"""
读取文本库中的文本内容以字典形式输出

:param path: 文本库地址
:return: 文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
"""
# 将文件夹内的文本全部导入程序
files = os.listdir(path) # 得到文件夹下的所有文件名称
all_docu_dic = {} # 接收文档名和文档内容的词典
for file in files: # 遍历文件夹
if not os.path.isdir(file): # 判断是否是文件夹，不是文件夹才打开
f = open(path + "/" + file, encoding='UTF-8-sig') # 打开文件
iter_f = iter(f) # 创建迭代器
strr = ""
for line in iter_f: # 遍历文件，一行行遍历，读取文本
strr = strr + line
all_docu_dic[file] = strr.strip('.') # 去除末尾的符号.
print("文件库：")
print(all_docu_dic)
return all_docu_dic

def dealDataSet(all_docu_dic):
"""
处理文件库字典的数据

:param all_docu_dic:文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
:return: 1.all_words_set 文本库的词库｛word1,word2,...｝
2.words_num_dic 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
"""
all_words = []
all_docu_cut = {} # 分完词后的dic(dic嵌套list)

stop_words = stopwords.words('english') # 原始停用词库
# #停用词的扩展
# print(len(stop_words))
# extra_words = [' ']#新增的停用词
# stop_words.extend(extra_words)#最后停用词
# print(len(stop_words))

# 计算所有文档总词库和分隔后的词库
for filename, content in all_docu_dic.items():
cut = re.split("[!? '.),(+-=。:]", content) # 分词
new_cut = [w for w in cut if w not in stop_words if w] # 去除停用词，并且去除split后产生的空字符
all_docu_cut[filename] = new_cut # 键为文本名，值为分词完成的list
all_words.extend(new_cut)
all_words_set = set(all_words) # 转化为集合形式

# 计算各文本中的词数
words_num_dic = {}
for filename, cut in all_docu_cut.items():
words_num_dic[filename] = dict.fromkeys(all_docu_cut[filename], 0)
for word in cut:
words_num_dic[filename][word] += 1
# print("词库：")
# print(all_words_set)
print("文件分词库：")
print(all_docu_cut)
return all_words_set, words_num_dic # 返回词库和文档词数字典

def computeTF(in_word, words_num_dic):
"""
计算单词in_word在每篇文档的TF

:param in_word: 单词
:param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
:return: tfDict: 单词in_word在所有文本中的tf值字典｛文件名1：tf1,文件名2：tf2,...｝
"""
allcount_dic = {} # 各文档的总词数
tfDict = {} # in_word的tf字典
# 计算每篇文档总词数
for filename, num in words_num_dic.items():
count = 0
for value in num.values():
count += value
allcount_dic[filename] = count
# 计算tf
for filename, num in words_num_dic.items():
if in_word in num.keys():
tfDict[filename] = num[in_word] / allcount_dic[filename]
return tfDict

def computeIDF(in_word, words_num_dic):
"""
计算in_word的idf值

:param in_word: 单词
:param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
:return: 单词in_word在整个文本库中的idf值
"""
docu_count = len(words_num_dic) # 总文档数
count = 0
for num in words_num_dic.values():
if in_word in num.keys():
count += 1
return math.log10((docu_count) / (count + 1))

def computeTFIDF(in_word, words_num_dic):
"""
计算in_word在每篇文档的tf-idf值

:param in_word: 单词
:param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
:return: tfidf_dic:单词in_word在所有文本中的tf-idf值字典｛文件名1：tfidf1,文件名2：tfidf2,...｝
"""
tfidf_dic = {}
idf = computeIDF(in_word, words_num_dic)
tf_dic = computeTF(in_word, words_num_dic)
for filename, tf in tf_dic.items():
tfidf_dic[filename] = tf * idf
return tfidf_dic

def text_save(filename, data, word):
"""
对检索词word的字典输出到filename的文件中

:param filename:输出文本的文件名
:param data: 字典类型
:param word: 关键词
"""
fp = open("D:/study/B4/" + filename, 'a')
fp.write("关键词:" + str(word) + '\n')
for line in data:
for a in line:
s = str(a)
fp.write('\t' + s)
fp.write('\t')
fp.write('\n')
fp.close()

def sortOut(dic):
"""
对字典内容按照value值排序，并保留value值

:param dic: 字典
:return: 嵌套元组的list
"""
return sorted(dic.items(), key=lambda item: item[1], reverse=True)

if __name__ == '__main__':
# 载入文件
print("\t默认文本库路径为：D:/study/B4/data")
print("\t搜索结果文本路径为：D:/study/B4/result")
path = "D:/study/B4/data" # 文本库路径
all_docu_dic = loadDataSet(path) # 加载文本库数据到程序中
words_set, words_num_dic = dealDataSet(all_docu_dic) # 处理数据返回值1.文本词库（已去除停用词），2.各文本词数的词典
n = 0 # 记录搜索次数
a = -1 # 控制程序终止的变量
while a != 0:
in_words = input("搜索：")
input_list = re.split("[!? '. ),(+-=。:]", in_words)
k = 0 # 用于记录单次输入的有效关键词的个数
n += 1
for i in range(len(input_list)):
if input_list[i] in words_set:
k += 1
tfidf_dic = computeTFIDF(input_list[i], words_num_dic) # 单词的tfidf未排序字典
# 控制台输出
print("关键词:" + input_list[i])
print(sortOut(tfidf_dic)[0:5]) # 输出前五个相关文本
# 文本输出
text_save("result" + str(n) + ".txt", sortOut(tfidf_dic)[0:5], input_list[i]) # 将排序后的tfidf字典保存到文件中
if k == 0:
print("无任何搜索结果")
a = input("任意键继续搜索，'0'退出:")
print("-------------------------------------")