第12章 Python Spark 建立推薦 引擎


12.4 如何蒐集資料?
網址進入moivelens 網站:
http://grouplens.org/datasets/movielens/
Step1. 下載ml-100k 資料
mkdir -p ~/pythonwork/PythonProject/data
cd ~/pythonwork/PythonProject/data
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
Step2. 解壓縮ml-100k 資料
unzip -j ml-100k

Step5. 啟動hadoop cluster
start-all.sh
Step6. 複製ml-100k 檔案至HDFS
hadoop fs -mkdir /user/hduser/data
hadoop fs -copyFromLocal -f ~/pythonwork/PythonProject/data /user/hduser/
hadoop fs -ls /user/hduser/data
12.5 啟動IPython Note Book
Step1. 執行eclipse 程式
輸入工作路徑
start-all.sh
cd ~/pythonwork/ipynotebook
執行IPython Notebook在hadoop yarn-client模式
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client

請參考本書附錄APPENDIX A 本書範例程式下載與安裝說明 ,A.2 開啟本書iPython Note Book範例程式ch12.ipynb 範例檔案 。


 12.10 建立Recommend推薦系統
請參考本書附錄APPENDIX A 本書範例程式下載與安裝說明 ,A.3 開啟eclipse PythonProject範例程式 : RecommendTrain.py,Recommend.py
此圖出自Spark官網 https://spark.apache.org/


以上內容節錄自這本書,很適合Python程式設計師學習Spark機器學習與大數據架構,點選下列連結查看本書詳細介紹:
  Python+Spark 2.0+Hadoop機器學習與大數據分析實戰
  http://pythonsparkhadoop.blogspot.tw/2016/10/pythonspark-20hadoop.html

《購買本書 限時特價專區》
博客來網路書店: http://www.books.com.tw/products/0010730134?loc=P_007_090

天瓏網路書店: https://www.tenlong.com.tw/items/9864341537?item_id=1023658
  

露天拍賣:http://goods.ruten.com.tw/item/show?21640846068139
蝦皮拍賣:https://goo.gl/IEx13P 



Share on Google Plus

About kevin

This is a short description in the author block about the author. You edit it by entering text in the "Biographical Info" field in the user admin panel.
    Blogger Comment
    Facebook Comment

4 意見:

  1. 老師您好:

    在用Eclipse執行RecommendTrain.py時,發生下列錯誤,可以幫忙指點一下嗎??感謝您!
    17/09/26 10:00:50 ERROR SparkContext: Error initializing SparkContext.
    java.net.ConnectException: Call From master/192.168.56.100 to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: 連線被拒絕; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.GeneratedConstructorAccessor7.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
    at org.apache.hadoop.ipc.Client.call(Client.java:1473)
    at org.apache.hadoop.ipc.Client.call(Client.java:1400)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
    at com.sun.proxy.$Proxy9.getNewApplication(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:217)
    at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy10.getNewApplication(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNewApplication(YarnClientImpl.java:206)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createApplication(YarnClientImpl.java:214)
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:157)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
    at org.apache.spark.SparkContext.(SparkContext.scala:500)
    at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:236)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:748)

    回覆刪除
  2. Caused by: java.net.ConnectException: 連線被拒絕
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706)
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522)
    at org.apache.hadoop.ipc.Client.call(Client.java:1439)
    ... 28 more
    17/09/26 10:00:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
    17/09/26 10:00:51 WARN MetricsSystem: Stopping a MetricsSystem that is not running
    Traceback (most recent call last):
    File "/home/hduser/pythonwork/PythonProject/RecommendTrain.py", line 53, in
    sc=CreateSparkContext()
    File "/home/hduser/pythonwork/PythonProject/RecommendTrain.py", line 23, in CreateSparkContext
    sc = SparkContext(conf = sparkConf)
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 168, in _do_init
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 233, in _initialize_context
    File "/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 1183, in __call__
    File "/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
    py4j.protocol.Py4JJavaError

    回覆刪除
  3. 您好:
    我在執行Recommend.py時碰到這問題(使用教學範例)
    master=local[4]
    ==========資料準備===============
    開始讀取電影ID與名稱對照表...
    ==========載入模型===============
    找不到ALSModel模型,請先訓練
    Traceback (most recent call last):
    File "/home/hduser/pythonwork/PythonProject/Recommend.py", line 79, in
    model=loadModel(sc)
    File "/home/hduser/pythonwork/PythonProject/Recommend.py", line 59, in loadModel
    return model
    UnboundLocalError: local variable 'model' referenced before assignment
    於是搜尋了類似的錯誤案例表示local變數model引用前沒有定義,請問這該如何解決呢?謝謝

    回覆刪除
  4. 在ALSmodel 已儲存Model 需要先刪除
    hadoop fs -rm -R /user/hduser/ALSmodel/data
    hadoop fs -rm -R /user/hduser/ALSmodel/matadata

    回覆刪除