8.1 安裝scala
Step1~4 下載安裝 Scala
wget http://www.scala-lang.org/files/archive/scala-2.11.6.tgz tar xvf scala-2.11.6.tgz sudo mv scala-2.11.6 /usr/local/scalaStep5 Scala使用者環境變數設定
修改~/.bashrc
sudo gedit ~/.bashrc輸入下列內容
export SCALA_HOME=/usr/local/scala export PATH=$PATH:$SCALA_HOME/binStep6 使讓~/.bashrc修改生效
source ~/.bashrc8.2 安裝Spark
Step1~3 下載安裝 Spark
wget http://apache.stu.edu.tw/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz
tar zxf spark-2.0.0-bin-hadoop2.6.tgz sudo mv spark-2.0.0-bin-hadoop2.6 /usr/local/spark/Step4 Spark使用者環境變數設定
修改~/.bashrc
sudo gedit ~/.bashrc輸入下列內容
export SPARK_HOME=/usr/local/spark export PATH=$PATH:$SPARK_HOME/binStep5 使讓~/.bashrc修改生效
source ~/.bashrc
8.3 啟動python spark互動介面
pyspark8.4 設定pyspark 顯示訊息
cd /usr/local/spark/conf cp log4j.properties.template log4j.properties修改log4j.properties
sudo gedit log4j.properties開啟gedit編輯log4j.properties,原本是INFO改為WARN
8.5 建立測試文字檔
此部分如果您第7章已經執行,就會建立測試文字檔,就可以省略此步驟。如果您沒有執行第7章,請依照下列指令
Step1. 複製LICENSE.txt
先建立工作目錄
mkdir -p ~/wordcount/input然後複製檔案
cp /usr/local/hadoop/LICENSE.txt ~/wordcount/input ll ~/wordcount/inputStep3. 進入master 虛擬機器,啟動Hadoop Multi-Node Cluster
start-all.shStep4. 上傳測試檔案至HDFS 目錄
hadoop fs -mkdir -p /user/hduser/wordcount/input cd ~/wordcount/input hadoop fs -copyFromLocal LICENSE.txt /user/hduser/wordcount/input hadoop fs -ls /user/hduser/wordcount/input8.6 本機執行pyspark 程式
Step1 進入pyspark
pyspark --master local[*]Step2. 查看目前的執行模式
sc.masterStep3 讀取本機檔案
textFile=sc.textFile("file:/usr/local/spark/README.md") textFile.count()Step4 讀取HDFS檔案
textFile=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt") textFile.count()Step5. 離開pyspark
exit()8.7 在Hadoop YARN執行pyspark
Step1 在Hadoop YARN執行pyspark
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode clientStep2. 查看目前的執行模式
sc.masterStep3 讀取HDFS檔案
textFile=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt") textFile.count()Step4. 在Hadoop Web 介面可以查看PySparkShell App
http://localhost:8088/離開pyspark
exit()8.8 建置Spark standalone cluster執行環境
Step1. 自樣板檔(template )複製spark-env.sh
cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.shStep2. 設定spark-env.sh
sudo gedit /usr/local/spark/conf/spark-env.sh輸入下列內容:
export SPARK_MASTER_IP=master export SPARK_WORKER_CORES=1 export SPARK_WORKER_MEMORY=512m export SPARK_EXECUTOR_INSTANCES=4Step3. 將master 的spark 程式複製到data1
ssh data1 sudo mkdir /usr/local/spark sudo chown hduser:hduser /usr/local/spark exit sudo scp -r /usr/local/spark hduser@data1:/usr/localStep4. 將master 的spark 程式複製到data2
ssh data2 sudo mkdir /usr/local/spark sudo chown hduser:hduser /usr/local/spark exit sudo scp -r /usr/local/spark hduser@data2:/usr/localStep5. 將master 的spark 程式複製到data3
ssh data3 sudo mkdir /usr/local/spark sudo chown hduser:hduser /usr/local/spark exit sudo scp -r /usr/local/spark hduser@data3:/usr/localStep6. 編輯slaves 檔案
sudo gedit /usr/local/spark/conf/slaves輸入下列內容:
data1 data2 data38.9 建置Spark standalone cluster執行環境
Step1. 啟動Spark standalone cluster
同時啟動master與salves
/usr/local/spark/sbin/start-all.sh分別啟動master與salves
/usr/local/spark/sbin/start-master.sh /usr/local/spark/sbin/start-slaves.shStep2. 在Spark Standalone 執行pyspark
pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512mStep3. 查看目前的執行模式
sc.masterStep4. 讀取本機檔案
textFile=sc.textFile("file:/usr/local/spark/README.md") textFile.count()Step5. 讀取HDFS 檔案
textFile=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt") textFile.count()8.10 Spark Web UI介面
Step1. 開啟Spark Web UI 介面
http://master:8080/Step4. 停止Spark stand alone cluster
/usr/local/spark/sbin/stop-all.sh
此圖出自Spark官網 https://spark.apache.org/
作者您好
回覆刪除pyspark在本機執行正常,但在yarn卻執行不了
擷取一些warning: WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243)
會不會這邊出現問題 請教要怎麼解決
謝謝您
作者您好,
回覆刪除與Daniel發生同樣問題, 這邊提供前三個WARNING供您參考, 其中第一個與您書本一樣, 不是問題點.
17/12/13 18:20:24 WARN NativeCodeLoader: Unable to....略
17/12/13 18:20:32 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/12/13 18:21:14 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
結果執行模式如下
>>> sc.master
u'yarn'
想請教要您是否知道如何解決,謝謝您
我也是跟前面兩位相同的問題, 執行後 卡了 15分鐘, 用Cntrl-C 中斷, 過程如下:
回覆刪除請幫忙, 謝謝~
hduser@master:~$ HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
18/03/29 10:00:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
^CTraceback (most recent call last):
File "/usr/local/spark/python/pyspark/shell.py", line 43, in
spark = SparkSession.builder\
File "/usr/local/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/spark/python/pyspark/context.py", line 294, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/spark/python/pyspark/context.py", line 115, in __init__
conf, jsc, profiler_cls)
File "/usr/local/spark/python/pyspark/context.py", line 168, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/usr/local/spark/python/pyspark/context.py", line 233, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 1181, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 695, in send_command
File "/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 828, in send_command
File "/usr/lib/python2.7/socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
我找到問題了, 要先啟動 hadoop, 在master 終端機執行 start-all
回覆刪除wget http://apache.stu.edu.tw/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz ->
回覆刪除wget https://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz
spark-2.0.0-bin-hadoop2.6.tgz 放置於 home folder 中
回覆刪除請問 /usr/local/spark/sbin/stop-all.sh : 為什麼前面要加路徑?
回覆刪除