機器人實驗場: Selenium 瀏覽器操作自動化

Selenium 是一個自動化操作瀏覽器的函式庫一般常用來做網站自動化測試，由於越來越多的網站應用 JavaScript 設計，因此直接透過 HTTP method 實做 crawler 在開發上較為複雜，而 Selenium 則成了撰寫 crawler 的一個替代方案。Selenium 的原理是呼叫瀏覽器，透過瀏覽器開啟網頁，接著對網頁中特定元件觸發事件，原則上只要是網頁中可顯示的元件都可以嘗試觸發事件，觸發的事件可為點擊滑鼠或是鍵盤輸入。基本上可以想像它就是用來模擬真實使用者操作瀏覽器的行為。

今天要示範的是在 Linux 下使用 python selenium 呼叫 firefox 31 (firefox v31為目前測試過可正常工作的版本，若瀏覽器發生問題可參考瀏覽器支援列表選擇相容的瀏覽器) 開啟 Google 首頁，並且觸發鍵盤輸入以及滑鼠點擊事件，模擬使用者的操作行為。

首先設定 python virtualenv 測試環境：
1. 設定 virtualenv 路徑(假設以 ~/.python/virt_env/selenium/ 為 virtualenv 路徑)

mkdir ~/.python/virt_env/

virtualenv ~/.python/virt_env/selenium/

2. 切換至測試環境

source ~/.python/virt_env/selenium/bin/activate

3. 安裝 yolk

easy_install yolk

4. 安裝 selenium, 安裝後可用 yolk 檢查套件

easy_install selenium

yolk -l

接著必須先了解哪些 Google 首頁哪些元件可以操作，selenium 操作可以透過 ID 或是 XPath 來指定網頁元件，可以透過 chrome 瀏覽器頁面點右鍵 -> 檢視元素，會跳出 Developer Tools 視窗，可檢視 html 語法

可以自語法中檢視 id 或是 class 等，例如文字輸入框 input 的 id="lst-ib"

此外也在元物件上點右鍵檢視 XPath

同樣的也可以用相同方法找到 "Google 搜尋" 按鈕的資訊。

前情提要完畢進入正題，selenium client webdrive 提供的主要 API 以及一般操作流程如下：
1. 初始化 selenium webdrive: browser=webdrive.Firefox()
2. 取得網頁: browser.get ("URL")
3. 指定特定元件: browser.find_element_by_id() / browser.find_element_by_class() ...etc
4. 對物件觸發事件: send_keys() / click()
5. 取得新的頁面結果: browser.page_source

可參考 Selenium python API 取得完整 API 文件

直接 try demo code 吧！

#!/usr/bin/python

# Web crawler demonstration
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

from time import sleep

# Using firefox browser
browser = webdriver.Firefox()

try:
    # Get a webpage : https://www.google.com/
    browser.get("https://www.google.com/")

    # Sleep 1 second between requests
    sleep (1)

    # Find the Inputbox of google search
    googlei_search_input_id = "lst-ib"
    inputBox = browser.find_element_by_id( googlei_search_input_id )

    # Simulate keyword input of the inputbox
    inputBox.send_keys ("robotexp selenium")

    # Sleep 1 second between requests
    sleep (1)

    # find the search button
    google_search_button_class = "jsb"
    searchButtonClass = browser.find_element_by_class_name ( google_search_button_class )
    searchButton = searchButtonClass.find_elements_by_tag_name( "input" )[0]

    # Click the google search button
    searchButton.click()

    # Get content
    content = browser.page_source
    print content

    # We can parsing the content

    # Close the browser if needed
    #browser.close()
except Exception as e:
    print e.message

機器人實驗場

網頁

2015年1月5日星期一

Selenium 瀏覽器操作自動化

沒有留言:

張貼留言

網頁

2015年1月5日 星期一

Selenium 瀏覽器操作自動化

沒有留言:

張貼留言

2015年1月5日星期一