webharvest抓取教程_java中有沒有第三方包有HttpWebRequest類的

1. 新視野英語教程（第二版）讀寫教程3課後答案

歷史老照片不能說的秘密慈禧軍閥明末清初文革晚清

14 頁/ 18頁
3.Learn to accept the fact that some people you thought were friends turn out to be enemies. 4.As you would expect from the book』s title, there are many references to what kind of man Gates is. 5.The prosperity of the company stems from hardworking and thrifty of the entire staff.
6.He said nothing at all on the subject of the play which was put on fro the first time Saturday night. XII.
1.至於那天晚上他是怎麼死的，事實上我無法解釋，而且也許不會有任何可能的解釋了。 2.做了一件事然後說自己本來不想那樣做是沒有用的；如果你不想做，你就不會做了。 3.微軟公司正在研究降低其產品成本的方法，以便發展中國家的人也能買得起。 4.蘋果公司也願意將其部分軟體與微軟公司的產品捆綁在一起，以促進其銷售。
5.與評價父親不同，人們評價母親依據的是其為母之道的成功或失敗。對於母親來說，一切都取決於孩子最終成為什麼樣的人。
6.人們會發現這個網站很有價值，因為我們投入了大量時間准備網站的信息。 Cloze XIII.
1.A 2.B 3.C 4.B 5.A 6.D 7.B 8.D 9.C 10.A 11.D 12.B 13.D 14.C 15.A 16.C 17.D 18.C  19.A  20.D
Section B
Comprehension of the Text II.
1.B 2.D 3.A 4.C 5.D 6.A 7.D 8.C Vocabulary III.
1.abolished  2.bribing  3.arrested  4.propose  5.vote  6.amend  7.regulating 8.discriminate  9.reverse  10.witness
IV.
1.with  2.to  3.at  4.for  5.in  6.in  7.aside  8.as  9.on  10.of Unit 8 Section A Vocabulary
III.
1. mount  2. resembles  3. implication   4. prohibits   5. deliberate 6. debate  7. classified   8. guidelines   9. split  10. generated Exercises on Web course only:
11. categories  12. breed  13. commission  14. draft  15. confusion IV.
1. within reach   2. fall into  3. in terms of  4. get around  5. regardless of  6. referred to  7. What if  8. in the first place  9. concerned about  10. identical to Exercises on Web course only:
11. in the wake of     12. comparable to  13. puzzling over V.
1. K   2. E  3. M  4. O  5. F  6. H  7. N  8. A   9. I  10. B Collocation
VI.
1. ties  2. emotions  3. interests  4. experience  5. responsibility  6. love  7. characteristics  8. memories  9. information  10. belief

15 頁/ 18頁
Word Building
VII.
1. transposition  2. transatlantic  3. transmigrants  4. transformed 5. transnational  6. transoceanic  7. transshipped   8. transported VIII.
1. nonexistent  2. non-stop  3. non-art  4. non-college
5. non-proctive  6. non-profit   7. non-fiction   8. non-violent Sentence Structure IX.
1.What if I say no
2.What if they don』t know
3.What if we can』t finish it on time 4.What if this happens to us someday 5.What if he has lied to us
X.
1. The Bosnian peace talks are continuing in Geneva today with the new proposals at the top of the agenda. 2. All of Southern Africa is suffering from a severe drought with Mozambique and Zimbabwe among the worse-hit countries.
3. The Europe Summit in Paris is drawing to an end with the US in danger of being completely isolated. 4. With the King in prison, the chief commander came to power and ruled the country. 5. With stability itself under threat, the reforms deserve all the support they can get.
Translation XI.
It sounds like a good idea, but what if it』s a trick?
Cities and towns in this area suffered a lot from the earthquake with Jiujiang and Ruichang among the worst-hit. He complained that they should not have got involved in it in the first place. For Mary』s sake, I can lend you my car to get around your transport problem.
In theory it』s feasible to clone a child to harvest organs, but in practice it would be psychologically harmful to the child. He published an article under the name of Braver which stresses the idea that the process of cloning animals would work for humans as well.
XII.
你說你不會把時間浪費在約會上，但如果遇到吸引你的男子，你會怎麼辦呢？
為了幫助艾滋病患者，需要有新的措施，地方社團、非政府機構、政府和國際組織之間要建立密切的合作關系。上周，該國際傳出消息說，他們正密切關注該地區的情況。
在導致數百人死亡的污染事件發生之後，政府開始起草環境保護指導方針。
正如這篇文章的作者所警告的，克隆人類可能是一件使人更加悲傷而非更加高興的事。在一些西方國家，有些父母准備克隆孩子，目的是進行非致使非致命器官的移植。 Cloze XIII.
1. A  2. B  3. D  4. B  5. C  6. A  7. C  8. C  9. A   10. C  11. C  12. B  13. B  14. C  15. A  15. B  17. D  18. A  19. D  20. B
Section B
Reading Skills I.

2. web scraper怎麼抓取網頁里其他鏈接里的數據

先設定一個類型為此岩link的selector，選擇鏈接，森蘆御勾選multiple；然後打開鏈接，在新頁面嘩啟設置需要抓取的元素就好了。Webscraper的詳細操作教程，可以到網易雲課堂搜「Webscraper實戰教學」，有詳細的二級頁面跳轉與頁面點擊的操作教程。

3. 如何使用 web-harvest

web-harvest的應用

一、背景

在當前信息空前爆炸的時代，人們不再擔心信息的匱乏，而是為篩選有用的信息付出大量的代價。那麼如何採集有用的信息呢？現在有RSS、博客等服務，但是並不能完全滿足我們的需求，因為很多信息並不是以格式化的數據形式提供出來，於是聰明的工或鋒程師想出了精確搜索的方法，從而出現大量的垂直搜索網站（比如酷訊），確實火了一把。當然我們無法得知他們是怎麼實現的，但是我們也可以實現這種精確採集，開源的Web- Harvest就是類似的技術，之前曾經接觸過，故寫出來分享給大家。

二、WebHarvest簡介

Web-Harvest 是一個用java 寫的開源的Web 數據提取工具。它提供了一種從所需的頁面上提取有用數據的方法。為了達到這個目的，你可能需要用到如XSLT,XQuery,和正則表達式等操作 text/xml 的相關技術。Web-Harvest 主要著眼於目前仍佔大多數的基於HMLT/XML 的頁面內容。另一方面，它也能通過寫自己的吵備Java 方法來輕易擴展其提取能力。

Web-Harvest 的主要目的是加強現有數據提取技術的應用。它的目標不是創造一種新方法，而是提供一種更好地使用和組合現有方法的方式。它提供了一個處理器集用於處理數據和控制流程，每一個處理器被看作是一個函數，它擁有參數和執行後同樣有結果返回。而且處理是被組合成一個管道的形式，這樣使得它們可以以鏈式的形式來執行，此外為了更易於數據操作和重用，Web-Harvest 還提供了變數上下方用於存儲已經聲明的變數。

上述流程的執行結果可以存儲在執行中創建的文件中或者是編程時的上下文環境中使用。

一、配置語言

每個提取過程都被定義在了一個或多個基於XML 的配置文件中，而且被描述為特定的或是結構化的XML 元素中。為了更好地說明，下面列舉了一個配置文件來進行說明：

<config charset=」gbk」>

<!– 頁面爬取開始,按照關鍵詞：「玩具」來搜索 –>

<var-def name=」start」 >

<html-to-xml>

<http url=」玩具「/>

</html-to-xml>

</var-def>

<!– 獲取競價排名的企業網站列表 –>

<var-def name=」urlList」 >

<xpath expression=」//div[@class='r']「>

<var name=」start」/>

</xpath>

</var-def>

<!– 循環 urlList ，並把結果寫入到XML文件中 –>

<file action=」write」 path=」/catalog.xml」 charset=」utf-8″>

<![CDATA[ <catalog> ]]>

<loop item=」item」 index=」i」>

<list$amp;>amp;$lt;var name=」urlList」/$amp;>amp;$lt;/list>

<body>

<xquery>

<xq-param name=」item」 type=」node()」$amp;>升團毀amp;$lt;var name=」item」/$amp;>amp;$lt;/xq-param>

<xq-expression$amp;>amp;$lt;![CDATA[

declare variable $item as node() external;

let $name := data($item//span/font[1]/text()[1])

let $url := data($item//span/font[2]/text())

return

<website>

<name>{normalize-space($name)}</name>

<url>{normalize-space($url)}</url>

</website>

]]$amp;>amp;$lt;/xq-expression>

</xquery>

</body>

</loop>

<![CDATA[ </catalog> ]]>

</file>

</config>

上述的配置文件包含了三段。

第一段的執行步驟：

1. 下載清除下載內容裡面的HTML 以產生XHTML;

3.

第二段的執行步驟：

1. 用XPath 表達式從所給的URL 裡面提取搜索結果；

2. 用一個新的變數「urlList」來保存上面的搜索結果；

第三段是利用上一段的搜索結果來提取相應的信息：

1. 循環裡面迭代每一個item；

2. 獲取每個item的name和url；

3. 將其保存在文件系統里；

有了配置文件（把該配置文件保存為:.xml），我們再往前一步，寫幾行代碼：

import java.io.IOException;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;

public class Test {

public static void main(String[] args) throws IOException {

ScraperConfiguration config = new ScraperConfiguration(」c:/.xml」);
Scraper scraper = new Scraper(config, 「c:/tmp/」);
scraper.setDebug(true);

long startTime = System.currentTimeMillis();
scraper.execute();
System.out.println(」time elapsed: 」 + (System.currentTimeMillis() - startTime));

}
}

讓我們執行一下，看看結果：

<catalog>

<website>

<name>上海麗強專業大型</name>

<url$amp;>amp;$lt;/url>

</website>

<website>

<name>多樣型大型</name>

<url$amp;>amp;$lt;/url>

</website>

<website>

<name>童博士卡通</name>

<url$amp;>amp;$lt;/url>

</website>

<website>

<name>芝麻街</name>

<url>c4</url>

</website>

<website>

<name>童博士, 中國平價學生用品..</name>

<url$amp;>amp;$lt;/url>

</website>

<website>

<name>充氣</name>

<url$amp;>amp;$lt;/url>

</website>

<website>

<name>找木製</name>

<url$amp;>amp;$lt;/url>

</website>

<website>

<name>米多迪</name>

<url>b14</url>

</website>

</catalog>

是不是很酷。爬蟲就這么簡單。

二、深入考慮

不知道大家看到上面的配置、代碼和結果是否感覺很熟悉。是否和Java通過Ibatis讀取資料庫數據的方式類似。

那我們是否可以實現這樣的機制呢，把整個互聯網作為我們的龐大的資料庫，我們隨意的讀取。

Web-Harvest提供了一個 ScraperContext , 可以在該上下文中設置Java對象，可以通過Java對象收集相應的結果數據，（比如：設置Map,可以通過Map收集數據）

Scraper 提供了這樣的方法：

scraper.getContext().put(「resDataSet」, new ResultDataSet());

ResultDataSet是收集數據的Java對象。

那麼我們就可以這么做：

a) 首先設置要訪問的網頁的路徑

scraper.getContext().put(「startPageHref」, 「;wd=兒童玩具「);

b) 第二步，設置要收集返回數據的容器

scraper.getContext().put(「resDataSet」, new ResultDataSet());

c) 在配置文件中就可以這樣設置數據

${resDataSet.addRecord(「searchResult」,「totalSearchResult」,totalSearchResult)};

d) 爬取操作執行完畢後，即可返回數據：

ResultDataSet resultDataSet = (ResultDataSet)scraper.getContext().get(「resDataSet」);

Ok，我們就可以隨心所欲的使用這些數據，詳細請看附件。

三、分頁機制處理

a) 來由介紹

現在的信息量很大，在展示的時候都是通過分頁處理的。

a) 實現機制

那我們怎麼處理呢？分頁提取數據我們得明確幾件事情

1. 分頁器的處理，比如：頁碼、頁大小、記錄數或頁數。

2. 「下一頁」的地址的構造

3. 每頁數據的爬取

不同的網站的分頁機制都不一樣，我們如何處理呢？當然我們不能通過硬編碼的方式來處理，我們就通過Web-Harvest的配置文件來實現。

Web-Harvest 本身的配置文件結構為：

<config charset=」gbk」>

配置信息

</config>

對這個結構進行擴展：

<web-harvest-config>

<!– 生成分頁器配置 –>

<config charset=」gbk」 id=」pagination」>

配置信息

</config>

<!– 組裝下一頁地址 –>

<config charset=」gbk」 id=」urlnav」>

配置信息

</config>

<!– 抓取列表數據 –>

<config charset=」gbk」 id=」listData」>

配置信息

</config>

</web-harvest-config>

我們就可以通過三個config項來處理

l 第一步，通過 id=」pagination」的配置生成分頁器

l 第二步，通過已經生成的分頁器加上 id=」urlnav」的配置構造下一頁的URL

l 第三步，通過 id=」listData」的配置提取需要的數據

一、Web-Harvest的優缺點

優點：

l Web-Harvest是一個使用比較方便的抓取信息的API庫，目前是1.0版本

l 擴展性好，只要修改配置文件即可

l 上手較快，使用方便。

l

缺點：

l 處理過程比較多，對應的速度較慢

二、其他使用過或者正在嘗試的精確抓取數據的方式

a) 使用HTMLParser

HTMLParser 可以分析HTML 源碼中的TAG（比如Table，DIV等），還可以自己定義TAG（比如：ENET），通過查找特定的Tag，提取相應的數據。由於沒有很多的中間處理過程，速度較快，缺點是有很多的硬編碼，難以擴展。或許能找出一個特定的表達式可以快速的提取數據。

b) 使用HTMLClean

該方式還是走HTML->XML的路線，首先通過HtmlClean把抓取的網頁內容轉化為XML格式數據，然後通過XPATH、XSL等方式對XML數據進行轉化，達到收集數據的目的。Web-Harvest是類似的方式，但是我們可以精簡化，提高抓取的效率。

三、使用爬蟲碰到的問題

a) 網站對頻繁抓取數據的爬蟲進行IP限制問題

4. java中有沒有第三方包有HttpWebRequest類的

這個是來dot net里的吧。自

JDK有個功能有限的HttpURLConnection

Apache HttpComponents 提供了相近的功能。(HttpClient, HttpAsyncClient)
https://hc.apache.org/index.html

還有個非同步的 AsyncHttpClient https://github.com/AsyncHttpClient/async-http-client

導航:首頁 > 文件教程 > webharvest抓取教程

webharvest抓取教程

與webharvest抓取教程相關的資料

友情鏈接