第11章 Python Spark 整合開發 環境介紹


11.1 下載與安裝eclipse Scala IDE
Step1. 瀏覽Scala IDE 網頁
http://scala-ide.org/
11.2 安裝pyDev
Step1. 執行eclipse 程式
輸入工作路徑
/home/hduser/pythonwork/
Step3. 新增套件
PyDev Location:
https://dl.bintray.com/fabioz/pydev/4.5.4/
11.3 設定字串替代變數
請參考本書說明設定字串替代變數
● SPARK_HOME(Spark的安裝路徑)
/usr/local/spark
● HADOOP_CONF_DIR(Hadoop組態檔的路徑)。
/usr/local/hadoop/etc/hadoop
● PYSPARK_PYTHON(anaconda程式庫路徑)。
/home/hduser/anaconda2/bin/python
11.5 PyDev設定anaconda2程式庫路徑
請參考本書說明設定新增anaconda2 路徑
/home/hduser/anaconda2/lib/python2.7/site-packages
11.6 PyDev設定Spark Python程式庫
請參考本書說明設定Spark Python程式庫
Spark 的Python 程式庫路徑
/usr/local/spark/python/lib
11.7 PyDev設定環境變數
請參考本書說明設定環境變數
SPARK_HOME
${SPARK_HOME}
HADOOP_CONF_DIR
${HADOOP_CONF_DIR}
11.10 輸入WordCount.py程式
WordCount.py 請參考本書附錄APPENDIX A 本書範例程式下載與安裝說明 ,A.3 開啟eclipse PythonProject範例程式
# -*- coding: UTF-8 -*-
from pyspark import SparkContext
from pyspark import SparkConf

def SetLogger( sc ):
    logger = sc._jvm.org.apache.log4j
    logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
    logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
    logger.LogManager.getRootLogger().setLevel(logger.Level.ERROR)   

def SetPath(sc):
    global Path
    if sc.master[0:5]=="local" :
        Path="file:/home/hduser/pythonwork/PythonProject/"
    else:   
        Path="hdfs://master:9000/user/hduser/"


def CreateSparkContext():
    sparkConf = SparkConf()                                                       \
                         .setAppName("WordCounts")                         \
                         .set("spark.ui.showConsoleProgress", "false") \
              
    sc = SparkContext(conf = sparkConf)
    print("master="+sc.master)
    SetLogger(sc)
    SetPath(sc)
    return (sc)

    

if __name__ == "__main__":
    print("開始執行RunWordCount")
    sc=CreateSparkContext()
 
    print("開始讀取文字檔...")
    textFile = sc.textFile(Path+"data/README.md")
    print("文字檔共"+str(textFile.count())+"行")
     
    countsRDD = textFile                                     \
                  .flatMap(lambda line: line.split(' ')) \
                  .map(lambda x: (x, 1))                    \
                  .reduceByKey(lambda x,y :x+y)
                  
    print("文字統計共"+str(countsRDD.count())+"筆資料")                  
    print("開始儲存至文字檔...")
    try:
        countsRDD.saveAsTextFile(Path+ "data/output")
    except Exception as e:
        print("輸出目錄已經存在,請先刪除原有目錄")
    sc.stop()

以上程式碼,請注意書上 第266頁 Step8. 儲存檔案 

儲存的路徑"output"有錯誤
countsRDD.saveAsTextFile(Path+ "output")
正確應該是"data/output"
countsRDD.saveAsTextFile(Path+ "data/output")
11.11 建立測試檔案並上傳測試檔至HDFS目錄
Step1. 複製本機測試檔案
輸入工作路徑
mkdir -p ~/pythonwork/PythonProject/data
cp /usr/local/spark/README.md ~/pythonwork/PythonProject/data
Step3. 啟動hadoop cluster
start-all.sh
Step4. 複製測試檔案至HDFS
輸入工作路徑
hadoop fs -mkdir -p /user/hduser/data
hadoop fs -copyFromLocal /usr/local/spark/README.md /user/hduser/data/README.md
hadoop fs -ls /user/hduser/data/README.md
11.12 使用spark-submit來執行WordCount程式
Step2. 在local 執行WordCount
cd ~/pythonwork/PythonProject

spark-submit --driver-memory 2g --master local[4] WordCount.py

Step3. 查看輸出檔案目錄
ll data/output
Step4. 查看輸出檔案內容
cat data/output/part-00000|more
11.13 在hadoop yarn-client執行WordCount程式
Step2. 在local 執行WordCount
cd ~/pythonwork/PythonProject

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop spark-submit --driver-memory 512m --executor-cores 2 --master yarn --deploy-mode client WordCount.py

Step2. 查看執行完成後HDFS 產生的目錄
hadoop fs -ls /user/hduser/data/output
Step3. 查看執行完成後HDFS 產生的檔案
hadoop fs -cat /user/hduser/data/output/part-00000|more
Step4. 在Hadoop Web 介面可以查看WordCounts
網址列輸入
http://localhost:8088/
11.14 在Spark Standalone Cluster執行 WordCount程式
Step1. 刪除已產生的目錄
hadoop fs -rm -R /user/hduser/data/output
Step2. 啟動Standalone Cluster
/usr/local/spark/sbin/start-all.sh
Step3. 在Spark Standalone Cluster 執行WordCount 程式
cd ~/pythonwork/PythonProject/

spark-submit --master spark://master:7077 --deploy-mode client --executor-memory 500M --deploy-mode client --total-executor-cores 2 WordCount.py
Step4. 查看程式執行後輸出目錄
hadoop fs -ls /user/hduser/data/output
Step5. Spark Standalone Web UI 介面
http://master:8080/
Step6. 刪除已產生的目錄
hadoop fs -rm -R /user/hduser/data/output
11.15 在eclipse外部工具執行Python Spark程式
Step3. 設定外部工具
name
spark-submit
location
/usr/local/spark/bin/spark-submit
Working Directory
${workspace_loc}/${project_name}
Arguments
--driver-memory 2g --master local[4] ${resource_name} ${string_prompt}
11.16 在eclipse執行spark-submit yarn-client
Step3. spark-submit yarn-client 設定外部工具
name
spark-submit yarn-client
Arguments
--driver-memory 1024m --executor-cores 2 --executor-memory 1g --master yarn--deploy-mode client ${resource_name} ${string_prompt}
11.17 在eclipse執行spark-submit Standalone
Step3. spark-submit yarn-client 設定外部工具
name
spark-submit Standalone
Arguments
--master spark://master:7077 --deploy-mode client --executor-memory 500M --totalexecutor-cores 2 ${resource_name} ${string_prompt}

此圖出自Spark官網 https://spark.apache.org/
Share on Google Plus

About kevin

This is a short description in the author block about the author. You edit it by entering text in the "Biographical Info" field in the user admin panel.
    Blogger Comment
    Facebook Comment

29 意見:

  1. Step3. spark-submit yarn-client 設定外部工具 Arguments 有誤 應該是--total-executor-cores

    回覆刪除
  2. Step3. spark-submit yarn-client 設定外部工具 Arguments 有誤 yarn--deploy-mode 中間要有空白

    回覆刪除
  3. 11.1 step4 壓縮管理員畫面如何叫出?

    回覆刪除
  4. 有辦法下載到scala-SDK-4.1.0-vfinal-2.11-linux.gtk.x86_64.tar.gz這個檔案嗎?有超連結可以下載嗎?

    回覆刪除
    回覆
    1. https://dl.bintray.com/fabioz/pydev/4.5.4/ don't work

      刪除
    2. https://dl.binary.com/fabioz/pydev/4.5.4/

      刪除
  5. Ubuntu下安装的eclipse没有功能列(工具列),該怎麼處理?

    回覆刪除
  6. 作者已經移除這則留言。

    回覆刪除
  7. 一、在eclipse執行spark-submit yarn-client有錯誤訊息
    Error: Cannot load main class from JAR file:/home/hduser/pythonwork/PythonProject/client
    Run with --help for usage help or --verbose for debug output

    回覆刪除
    回覆
    1. 你好,我在Eclipse中叶遇到同样问题,你现在解决了吗?

      刪除
  8. 二、在eclipse執行spark-submit spark-submit Standalone有錯誤訊息
    Error: Unrecognized option: --totalexecutor-cores

    回覆刪除
  9. 請問才剛解壓好eclipse,有人彈出an error has occurred see the log file問題嗎...?

    回覆刪除
  10. 請問目前官方載點只有4.7版本,支援的是jdk 8以上版本,但前面安裝都是jdk 7,請問可以同時安裝jdk 8或是還有scala ide 4.1版公下載嗎?感謝!

    回覆刪除
  11. 請問spark-submit執行時都會出現錯誤訊息,「ERROR - failed to write data to stream: ', mode 'w' at 0x7fc8d36bd150>」此部分該如何調整,感謝!

    回覆刪除
    回覆
    1. 請問這問題解決了嗎?
      遇到一樣的問題...
      「ERROR - failed to write data to stream: ', mode 'w' at 0x7fc8d36bd150>」

      刪除
    2. 在local端可跑ㄝ,在yarn上不能跑

      刪除
  12. 启动eclipse时。弹出
    An error has occurred. See the log file
    /home/hduser/eclipse/configuration/1525866543026.log.

    回覆刪除
  13. version 4.1.0: http://downloads.typesafe.com/scalaide-pack/4.1.0-vfinal-luna-211-20150525/scala-SDK-4.1.0-vfinal-2.11-linux.gtk.x86_64.tar.gz

    如果發生錯誤產生log file,請打開log file確認是不是JDK版本的問題,4.7.1需要JDK 8,前面都只裝JDK 7。

    回覆刪除
  14. yarn--deploy-mode => yarn --deploy-mode 執行

    回覆刪除
  15. 作者已經移除這則留言。

    回覆刪除
  16. 在eclipse下執行 spark-submit yarn-client 出現下列錯誤:

    開始執行RunWordCount
    20/06/09 15:06:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    20/06/09 15:06:16 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
    master=yarn
    開始讀取文字檔...
    20/06/09 15:06:47 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
    Traceback (most recent call last):
    File "/home/hduser/pythonwork/PythonProject/WordCount.py", line 38, in
    print("文字檔共"+str(textFile.count())+"行")
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1008, in count
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 999, in sum
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 873, in fold
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 776, in collect
    File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
    File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
    py4j.protocol.Py4JJavaError

    可有人處理過?
    感謝~

    回覆刪除
  17. 在eclipse下執行 spark-submit Standalone 出現下列錯誤:

    開始執行RunWordCount
    20/06/09 15:06:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    20/06/09 15:06:16 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
    master=yarn
    開始讀取文字檔...
    20/06/09 15:06:47 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
    Traceback (most recent call last):
    File "/home/hduser/pythonwork/PythonProject/WordCount.py", line 38, in
    print("文字檔共"+str(textFile.count())+"行")
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1008, in count
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 999, in sum
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 873, in fold
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 776, in collect
    File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
    File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
    py4j.protocol.Py4JJavaError

    可有人處理過?
    感謝~

    回覆刪除
  18. PyDev : http://pydev.sourceforge.net/pydev_update_site/4.5.4

    回覆刪除
  19. 輸出目錄已經存在,請先刪除原有目錄
    hadoop fs -rm -R /user/hduser/data/output

    回覆刪除
  20. 11.14 啟動Standalone Cluster
    /usr/local/spark/sbin/start-all.sh

    回覆刪除