第8章 Python Spark 2.0 介紹與 安裝


8.1 安裝scala
Step1~4 下載安裝 Scala
wget http://www.scala-lang.org/files/archive/scala-2.11.6.tgz
tar xvf scala-2.11.6.tgz
sudo mv scala-2.11.6 /usr/local/scala
Step5 Scala使用者環境變數設定
修改~/.bashrc
sudo gedit ~/.bashrc
輸入下列內容
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
Step6 使讓~/.bashrc修改生效
source ~/.bashrc
8.2 安裝Spark
Step1~3 下載安裝 Spark

wget http://apache.stu.edu.tw/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz 
tar zxf spark-2.0.0-bin-hadoop2.6.tgz
sudo mv spark-2.0.0-bin-hadoop2.6 /usr/local/spark/
Step4 Spark使用者環境變數設定
修改~/.bashrc
sudo gedit ~/.bashrc
輸入下列內容
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
Step5 使讓~/.bashrc修改生效
source ~/.bashrc

8.3 啟動python spark互動介面

pyspark
8.4 設定pyspark 顯示訊息

cd /usr/local/spark/conf
cp log4j.properties.template log4j.properties 
修改log4j.properties
sudo gedit log4j.properties
開啟gedit編輯log4j.properties,原本是INFO改為WARN
8.5 建立測試文字檔
此部分如果您第7章已經執行,就會建立測試文字檔,就可以省略此步驟。如果您沒有執行第7章,請依照下列指令
Step1. 複製LICENSE.txt
先建立工作目錄
mkdir -p ~/wordcount/input
然後複製檔案
cp /usr/local/hadoop/LICENSE.txt ~/wordcount/input
ll ~/wordcount/input
Step3. 進入master 虛擬機器,啟動Hadoop Multi-Node Cluster
start-all.sh
Step4. 上傳測試檔案至HDFS 目錄
hadoop fs -mkdir -p /user/hduser/wordcount/input
cd ~/wordcount/input
hadoop fs -copyFromLocal LICENSE.txt /user/hduser/wordcount/input
hadoop fs -ls /user/hduser/wordcount/input
8.6 本機執行pyspark 程式
Step1 進入pyspark
pyspark --master local[*]
Step2. 查看目前的執行模式
sc.master
Step3 讀取本機檔案
textFile=sc.textFile("file:/usr/local/spark/README.md")
textFile.count()
Step4 讀取HDFS檔案
textFile=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()
Step5. 離開pyspark
exit()
8.7 在Hadoop YARN執行pyspark
Step1 在Hadoop YARN執行pyspark
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
Step2. 查看目前的執行模式
sc.master
Step3 讀取HDFS檔案
textFile=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()
Step4. 在Hadoop Web 介面可以查看PySparkShell App
http://localhost:8088/
離開pyspark
exit()
8.8 建置Spark standalone cluster執行環境
Step1. 自樣板檔(template )複製spark-env.sh
cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
Step2. 設定spark-env.sh
sudo gedit /usr/local/spark/conf/spark-env.sh
輸入下列內容:
export SPARK_MASTER_IP=master
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=512m
export SPARK_EXECUTOR_INSTANCES=4
Step3. 將master 的spark 程式複製到data1
ssh data1
sudo mkdir /usr/local/spark
sudo chown hduser:hduser /usr/local/spark
exit
sudo scp -r /usr/local/spark hduser@data1:/usr/local
Step4. 將master 的spark 程式複製到data2
ssh data2
sudo mkdir /usr/local/spark
sudo chown hduser:hduser /usr/local/spark
exit
sudo scp -r /usr/local/spark hduser@data2:/usr/local
Step5. 將master 的spark 程式複製到data3
ssh data3
sudo mkdir /usr/local/spark
sudo chown hduser:hduser /usr/local/spark
exit
sudo scp -r /usr/local/spark hduser@data3:/usr/local
Step6. 編輯slaves 檔案
sudo gedit /usr/local/spark/conf/slaves
輸入下列內容:
data1
data2
data3
8.9 建置Spark standalone cluster執行環境
Step1. 啟動Spark standalone cluster
同時啟動master與salves
/usr/local/spark/sbin/start-all.sh
分別啟動master與salves
/usr/local/spark/sbin/start-master.sh
/usr/local/spark/sbin/start-slaves.sh
Step2. 在Spark Standalone 執行pyspark
pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
Step3. 查看目前的執行模式
sc.master
Step4. 讀取本機檔案
textFile=sc.textFile("file:/usr/local/spark/README.md")
textFile.count()
Step5. 讀取HDFS 檔案
textFile=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()
8.10 Spark Web UI介面
Step1. 開啟Spark Web UI 介面
http://master:8080/
Step4. 停止Spark stand alone cluster
/usr/local/spark/sbin/stop-all.sh

此圖出自Spark官網 https://spark.apache.org/
Share on Google Plus

About kevin

This is a short description in the author block about the author. You edit it by entering text in the "Biographical Info" field in the user admin panel.
    Blogger Comment
    Facebook Comment

7 意見:

  1. 作者您好
    pyspark在本機執行正常,但在yarn卻執行不了
    擷取一些warning: WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243)
    會不會這邊出現問題 請教要怎麼解決
    謝謝您

    回覆刪除
  2. 作者您好,
    與Daniel發生同樣問題, 這邊提供前三個WARNING供您參考, 其中第一個與您書本一樣, 不是問題點.
    17/12/13 18:20:24 WARN NativeCodeLoader: Unable to....略
    17/12/13 18:20:32 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
    17/12/13 18:21:14 ERROR SparkContext: Error initializing SparkContext.
    org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

    結果執行模式如下
    >>> sc.master
    u'yarn'

    想請教要您是否知道如何解決,謝謝您

    回覆刪除
  3. 我也是跟前面兩位相同的問題, 執行後 卡了 15分鐘, 用Cntrl-C 中斷, 過程如下:
    請幫忙, 謝謝~
    hduser@master:~$ HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
    Python 2.7.6 (default, Jun 22 2015, 17:58:13)
    [GCC 4.8.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    18/03/29 10:00:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    ^CTraceback (most recent call last):
    File "/usr/local/spark/python/pyspark/shell.py", line 43, in
    spark = SparkSession.builder\
    File "/usr/local/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
    File "/usr/local/spark/python/pyspark/context.py", line 294, in getOrCreate
    SparkContext(conf=conf or SparkConf())
    File "/usr/local/spark/python/pyspark/context.py", line 115, in __init__
    conf, jsc, profiler_cls)
    File "/usr/local/spark/python/pyspark/context.py", line 168, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
    File "/usr/local/spark/python/pyspark/context.py", line 233, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
    File "/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 1181, in __call__
    File "/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 695, in send_command
    File "/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 828, in send_command
    File "/usr/lib/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
    KeyboardInterrupt

    回覆刪除
  4. 我找到問題了, 要先啟動 hadoop, 在master 終端機執行 start-all

    回覆刪除
  5. wget http://apache.stu.edu.tw/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz ->
    wget https://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz

    回覆刪除
  6. spark-2.0.0-bin-hadoop2.6.tgz 放置於 home folder 中

    回覆刪除
  7. 請問 /usr/local/spark/sbin/stop-all.sh : 為什麼前面要加路徑?

    回覆刪除