关于python：如何在Google colaboratory上使用GloVe word-embeddings文件

How to use GloVe word-embeddings file on Google colaboratory

我用wget下载了数据

1 2	!wget http://nlp.stanford.edu/data/glove.6B.zip - ‘glove.6B.zip’ saved [862182613/862182613]

它保存为zip，我想使用zip文件中的glove.6B.300d.txt文件。我想要实现的是：

1
2
3
4
5
6
7

embeddings_index = {}
with io.open('glove.6B.300d.txt', encoding='utf8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:],dtype='float32')
embeddings_index[word] = coefs

我当然有这个错误：

1
2
3
4
5
6
7
8
9

IOErrorTraceback (most recent call last)
<ipython-input-47-d07cafc85c1c> in <module>()
1 embeddings_index = {}
----> 2 with io.open('glove.6B.300d.txt', encoding='utf8') as f:
3 for line in f:
4 values = line.split()
5 word = values[0]

IOError: [Errno 2] No such file or directory: 'glove.6B.300d.txt'

如何在Google colab上面的代码中解压缩并使用该文件？

您可以采取的另一种方法如下。

1.下载zip文件

1	!wget http://nlp.stanford.edu/data/glove.6B.zip

下载zip文件后，它将保存在google Collab的/ content目录中。

解压缩它

1	!unzip glove*.zip

3.获取使用嵌入向量的精确路径

1
2

!ls
!pwd

4.索引向量

1
2
3
4
5
6
7
8
9
10
11
12

print('Indexing word vectors.')

embeddings_index = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

5.用谷歌驱动器保险丝

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

!pip install --upgrade pip
!pip install -U -q pydrive
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null

!apt-get -y install -qq google-drive-ocamlfuse fuse

from google.colab import auth
auth.authenticate_user()
# Generate creds for the Drive FUSE library.
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

!mkdir -p drive
!google-drive-ocamlfuse drive

6.将索引向量保存到Google驱动器以供重复使用

1 2	import pickle pickle.dump({'embeddings_index' : embeddings_index } , open('drive/path/to/your/file/location', 'wb'))

如果您已经在本地系统中下载了zip文件，只需将其解压缩并上传所需的维度文件到google drive - > fuse gdrive - >给出相应的路径，然后使用它/制作索引等。

另外，如果已经通过协作中的代码在本地系统中下载了

1 2	from google.colab import files files.upload()

选择文件并按照步骤3开始使用它。

这就是你在谷歌合作实验中使用手套词嵌入的方法。希望能帮助到你。

它的简单，结帐来自SO的旧帖子。

1
2
3
4

import zipfile
zip_ref = zipfile.ZipFile(path_to_zip_file, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()

相关讨论

我想在Google colab上做这件事。我不认为手套拉链被保存到我的电脑里。
假设zipfile进入wget命令提到的当前目录，只需指定glove.6B.zip作为路径 - 我认为它应该工作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

File"<ipython-input-60-785ab10a0dbb>"，第2行zip_ref = zipfile.ZipFile(glove.6B.zip，'r')^ SyntaxError：invalid syntax} <wyn>wmh</li><li>wmhThis needs to be corrected to <wyn>zipfile.ZipFile("glove.6B.zip", 'r') </wyn>你注意到指定了< x3>表示文件名
</li><li>
哦，谢谢你和directory_to_extract_to行在我的计算机上提取。而不是extracall我怎么能指定一个文件？
</li><li>
试试这个<wyn>zip_ref.extractall(".")</wyn>，一旦完成，你可以使用os.listdir函数来检查所有文件被提取到当前目录的内容
</li><li>
非常感谢您的指导！目前我没有遇到任何错误。我接受你的回答作为正确的答案。祝你今天愉快！</li></ul>[/collapse]</div><center>[wp_ad_camp_1]</center><hr>

如果您有Google云端硬盘，则可以：


<li>

安装Google云端硬盘，以便可以在Colab笔记本中使用它


[cc lang="python"]from google.colab import drive
drive.mount('/content/gdrive')

下载gloves.6B.zip并将其提取到您选择的地方，例如，在Google云端硬盘上

1
"My Drive/Place/Of/Your/Choice/glove.6B.300d.txt"
直接从Colab笔记本打开文件

1
with io.open('/content/gdrive/Place/Of/Your/Choice/glove.6B.300d.txt', encoding='utf8') as f: