11.1 下載與安裝eclipse Scala IDE
Step1. 瀏覽Scala IDE 網頁
http://scala-ide.org/11.2 安裝pyDev
Step1. 執行eclipse 程式
輸入工作路徑
/home/hduser/pythonwork/Step3. 新增套件
PyDev Location:
https://dl.bintray.com/fabioz/pydev/4.5.4/11.3 設定字串替代變數
請參考本書說明設定字串替代變數
● SPARK_HOME(Spark的安裝路徑)
/usr/local/spark● HADOOP_CONF_DIR(Hadoop組態檔的路徑)。
/usr/local/hadoop/etc/hadoop● PYSPARK_PYTHON(anaconda程式庫路徑)。
/home/hduser/anaconda2/bin/python11.5 PyDev設定anaconda2程式庫路徑
請參考本書說明設定新增anaconda2 路徑
/home/hduser/anaconda2/lib/python2.7/site-packages11.6 PyDev設定Spark Python程式庫
請參考本書說明設定Spark Python程式庫
Spark 的Python 程式庫路徑
/usr/local/spark/python/lib11.7 PyDev設定環境變數
請參考本書說明設定環境變數
SPARK_HOME
${SPARK_HOME}HADOOP_CONF_DIR
${HADOOP_CONF_DIR}11.10 輸入WordCount.py程式
WordCount.py 請參考本書附錄APPENDIX A 本書範例程式下載與安裝說明 ,A.3 開啟eclipse PythonProject範例程式
# -*- coding: UTF-8 -*- from pyspark import SparkContext from pyspark import SparkConf def SetLogger( sc ): logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) logger.LogManager.getRootLogger().setLevel(logger.Level.ERROR) def SetPath(sc): global Path if sc.master[0:5]=="local" : Path="file:/home/hduser/pythonwork/PythonProject/" else: Path="hdfs://master:9000/user/hduser/" def CreateSparkContext(): sparkConf = SparkConf() \ .setAppName("WordCounts") \ .set("spark.ui.showConsoleProgress", "false") \ sc = SparkContext(conf = sparkConf) print("master="+sc.master) SetLogger(sc) SetPath(sc) return (sc) if __name__ == "__main__": print("開始執行RunWordCount") sc=CreateSparkContext() print("開始讀取文字檔...") textFile = sc.textFile(Path+"data/README.md") print("文字檔共"+str(textFile.count())+"行") countsRDD = textFile \ .flatMap(lambda line: line.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y :x+y) print("文字統計共"+str(countsRDD.count())+"筆資料") print("開始儲存至文字檔...") try: countsRDD.saveAsTextFile(Path+ "data/output") except Exception as e: print("輸出目錄已經存在,請先刪除原有目錄") sc.stop()以上程式碼,請注意書上 第266頁 Step8. 儲存檔案
儲存的路徑"output"有錯誤
countsRDD.saveAsTextFile(Path+ "output")正確應該是"data/output"
countsRDD.saveAsTextFile(Path+ "data/output")11.11 建立測試檔案並上傳測試檔至HDFS目錄
Step1. 複製本機測試檔案
輸入工作路徑
mkdir -p ~/pythonwork/PythonProject/data cp /usr/local/spark/README.md ~/pythonwork/PythonProject/dataStep3. 啟動hadoop cluster
start-all.shStep4. 複製測試檔案至HDFS
輸入工作路徑
hadoop fs -mkdir -p /user/hduser/data hadoop fs -copyFromLocal /usr/local/spark/README.md /user/hduser/data/README.md hadoop fs -ls /user/hduser/data/README.md11.12 使用spark-submit來執行WordCount程式
Step2. 在local 執行WordCount
cd ~/pythonwork/PythonProject spark-submit --driver-memory 2g --master local[4] WordCount.pyStep3. 查看輸出檔案目錄
ll data/outputStep4. 查看輸出檔案內容
cat data/output/part-00000|more11.13 在hadoop yarn-client執行WordCount程式
Step2. 在local 執行WordCount
cd ~/pythonwork/PythonProject HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop spark-submit --driver-memory 512m --executor-cores 2 --master yarn --deploy-mode client WordCount.pyStep2. 查看執行完成後HDFS 產生的目錄
hadoop fs -ls /user/hduser/data/outputStep3. 查看執行完成後HDFS 產生的檔案
hadoop fs -cat /user/hduser/data/output/part-00000|moreStep4. 在Hadoop Web 介面可以查看WordCounts
網址列輸入
http://localhost:8088/11.14 在Spark Standalone Cluster執行 WordCount程式
Step1. 刪除已產生的目錄
hadoop fs -rm -R /user/hduser/data/outputStep2. 啟動Standalone Cluster
/usr/local/spark/sbin/start-all.shStep3. 在Spark Standalone Cluster 執行WordCount 程式
cd ~/pythonwork/PythonProject/ spark-submit --master spark://master:7077 --deploy-mode client --executor-memory 500M --deploy-mode client --total-executor-cores 2 WordCount.pyStep4. 查看程式執行後輸出目錄
hadoop fs -ls /user/hduser/data/outputStep5. Spark Standalone Web UI 介面
http://master:8080/Step6. 刪除已產生的目錄
hadoop fs -rm -R /user/hduser/data/output11.15 在eclipse外部工具執行Python Spark程式
Step3. 設定外部工具
name
spark-submitlocation
/usr/local/spark/bin/spark-submitWorking Directory
${workspace_loc}/${project_name}Arguments
--driver-memory 2g --master local[4] ${resource_name} ${string_prompt}11.16 在eclipse執行spark-submit yarn-client
Step3. spark-submit yarn-client 設定外部工具
name
spark-submit yarn-clientArguments
--driver-memory 1024m --executor-cores 2 --executor-memory 1g --master yarn--deploy-mode client ${resource_name} ${string_prompt}11.17 在eclipse執行spark-submit Standalone
Step3. spark-submit yarn-client 設定外部工具
name
spark-submit StandaloneArguments
--master spark://master:7077 --deploy-mode client --executor-memory 500M --totalexecutor-cores 2 ${resource_name} ${string_prompt}
此圖出自Spark官網 https://spark.apache.org/
Step3. spark-submit yarn-client 設定外部工具 Arguments 有誤 應該是--total-executor-cores
回覆刪除Step3. spark-submit yarn-client 設定外部工具 Arguments 有誤 yarn--deploy-mode 中間要有空白
回覆刪除11.1 step4 壓縮管理員畫面如何叫出?
回覆刪除有辦法下載到scala-SDK-4.1.0-vfinal-2.11-linux.gtk.x86_64.tar.gz這個檔案嗎?有超連結可以下載嗎?
回覆刪除https://dl.bintray.com/fabioz/pydev/4.5.4/ don't work
刪除https://dl.binary.com/fabioz/pydev/4.5.4/
刪除Ubuntu下安装的eclipse没有功能列(工具列),該怎麼處理?
回覆刪除以上問題己解決
回覆刪除作者已經移除這則留言。
回覆刪除一、在eclipse執行spark-submit yarn-client有錯誤訊息
回覆刪除Error: Cannot load main class from JAR file:/home/hduser/pythonwork/PythonProject/client
Run with --help for usage help or --verbose for debug output
你好,我在Eclipse中叶遇到同样问题,你现在解决了吗?
刪除二、在eclipse執行spark-submit spark-submit Standalone有錯誤訊息
回覆刪除Error: Unrecognized option: --totalexecutor-cores
請問才剛解壓好eclipse,有人彈出an error has occurred see the log file問題嗎...?
回覆刪除請問目前官方載點只有4.7版本,支援的是jdk 8以上版本,但前面安裝都是jdk 7,請問可以同時安裝jdk 8或是還有scala ide 4.1版公下載嗎?感謝!
回覆刪除請問spark-submit執行時都會出現錯誤訊息,「ERROR - failed to write data to stream: ', mode 'w' at 0x7fc8d36bd150>」此部分該如何調整,感謝!
回覆刪除請問這問題解決了嗎?
刪除遇到一樣的問題...
「ERROR - failed to write data to stream: ', mode 'w' at 0x7fc8d36bd150>」
在local端可跑ㄝ,在yarn上不能跑
刪除启动eclipse时。弹出
回覆刪除An error has occurred. See the log file
/home/hduser/eclipse/configuration/1525866543026.log.
version 4.1.0: http://downloads.typesafe.com/scalaide-pack/4.1.0-vfinal-luna-211-20150525/scala-SDK-4.1.0-vfinal-2.11-linux.gtk.x86_64.tar.gz
回覆刪除如果發生錯誤產生log file,請打開log file確認是不是JDK版本的問題,4.7.1需要JDK 8,前面都只裝JDK 7。
請問您是如何找到以上網址?
刪除yarn--deploy-mode => yarn --deploy-mode 執行
回覆刪除作者已經移除這則留言。
回覆刪除在eclipse下執行 spark-submit yarn-client 出現下列錯誤:
回覆刪除開始執行RunWordCount
20/06/09 15:06:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/06/09 15:06:16 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
master=yarn
開始讀取文字檔...
20/06/09 15:06:47 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "/home/hduser/pythonwork/PythonProject/WordCount.py", line 38, in
print("文字檔共"+str(textFile.count())+"行")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1008, in count
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 999, in sum
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 873, in fold
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 776, in collect
File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError
可有人處理過?
感謝~
在eclipse下執行 spark-submit Standalone 出現下列錯誤:
回覆刪除開始執行RunWordCount
20/06/09 15:06:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/06/09 15:06:16 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
master=yarn
開始讀取文字檔...
20/06/09 15:06:47 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "/home/hduser/pythonwork/PythonProject/WordCount.py", line 38, in
print("文字檔共"+str(textFile.count())+"行")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1008, in count
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 999, in sum
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 873, in fold
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 776, in collect
File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError
可有人處理過?
感謝~
PyDev : http://pydev.sourceforge.net/pydev_update_site/4.5.4
回覆刪除設定為"英",“”
回覆刪除輸出目錄已經存在,請先刪除原有目錄
回覆刪除hadoop fs -rm -R /user/hduser/data/output
11.14 啟動Standalone Cluster
回覆刪除/usr/local/spark/sbin/start-all.sh
注意連線問題
回覆刪除