Python:自动更正

Python: Auto-correct

我有两个文件check.txt和orig.txt。我想检查check.txt中的每个单词,看看它是否与orig.txt中的任何单词匹配。如果它确实匹配,那么代码应该用它的第一个匹配项替换该单词,否则它应该保留原来的单词。但不知何故,它并没有按要求工作。请帮忙。

check.txt如下:

1
2
3
4
5
ukrain

troop

force

而orig.txt看起来:

1
2
3
4
5
6
ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama @bostonglobe @crowleycnn @hardball

rt @cbcnews: breaking: .@vice journalist @simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou

russia 'outraged' at deadly shootout in east #ukraine -  moscow:... http://t.co/nqim7uk7zg
 #groundtroops #russianpresidentvladimirputin

http://pastebin.com/xjedhy3g

1
2
3
4
5
6
7
8
9
10
11
12
13
f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')

for word in f:
    for line in orig:
        for word2 in line.split(""):
            word2 = word2.lower()            
            if word in word2:
                word = word2
            else:
                print('not found')
        new.write(word)


您的代码有两个问题:

  • 当您循环遍历f中的单词时,每个单词仍将有一个新行字符,因此您的in检查不起作用。
  • 您想为来自f的每个单词迭代orig,但文件是迭代器,在来自f的第一个单词之后耗尽。
  • 您可以通过执行word = word.strip()orig = list(orig)来修复这些问题,或者您可以尝试类似的操作:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # get all stemmed words
    stemmed = [line.strip() for line in f]
    # set of lowercased original words
    original = set(word.lower() for line in orig for word in line.split())
    # map stemmed words to unstemmed words
    unstemmed = {word: None for word in stemmed}
    # find original words for word stems in map
    for stem in unstemmed:
        for word in original:
            if stem in word:
                unstemmed[stem] = word
    print unstemmed

    或者更短(没有最后的双循环),使用difflib,如注释所示:

    1
    unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}

    另外,记住close您的文件,或者使用with关键字自动关闭它们。