How do I enable Snappy codec support in a Spark cluster launched with Google Cloud Dataproc?
当尝试从使用Google Cloud Dataproc启动的Spark集群中读取Snappy压缩序列文件时,收到以下警告:
1 | java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support. |
在这种情况下启用Snappy编解码器支持的最佳方法是什么?
不幸的是,Dataproc的启动映像是在没有Snappy支持的情况下构建的。我已经打开了一个错误,以解决下一个图像问题。
解决方法:
首先创建一个小的Shell脚本,以正确安装snappy及其本机库支持。为此,我们将使用bdutil使用的相同本机库。我将脚本称为
1 2 3 4 5 6 | #!/bin/bash pushd"$(mktemp -d)" apt-get install -q -y libsnappy1 wget https://storage.googleapis.com/hadoop-native-dist/Hadoop_2.7.1-Linux-amd64-64.tar.gz tar zxvf Hadoop_2.7.1-Linux-amd64-64.tar.gz -C /usr/lib/hadoop/ |
将新的Shell脚本复制到您拥有的GCS存储桶中。出于演示目的,让我们假设存储桶为
1 | gsutil cp ./setup-snappy.sh gs://dataproc-actions/setup-snappy.sh |
启动集群时,请指定初始化操作:
1 | gcloud beta dataproc clusters create --initialization-actions gs://dataproc-actions/setup-snappy.sh mycluster |
我还没有自己做,但是这篇文章应该可以解决您的问题:
For installing and configuring other system-level components bdutil supports an extension mechanism. A good example of extensions is the Spark extension bundled with bdutil: extensions/spark/spark_env.sh. When running bdutil extensions are added with the -e flag e.g., to deploy Spark with Hadoop:
./bdutil -e extensions/spark/spark_env.sh deploy