htmlunit本地文件_java相關爬蟲問題關於新浪微博謝謝！

『壹』如何用htmlunit把網頁上的flash截取下來

java">importjava.io.FileOutputStream;
importjava.io.InputStream;
importjava.util.regex.Matcher;
importjava.util.regex.Pattern;

importorg.apache.commons.io.IOUtils;

importcom.gargoylesoftware.htmlunit.Page;
importcom.gargoylesoftware.htmlunit.WebClient;

publicclassDownloadFile{
	publicstaticvoidmain(String[]args)throwsException{
		StringbaseUrl="http://hanyu.iciba.com/hanzi/1.shtml";
		StringbihuaRegex="class="guanggao"[^<]*<[^<]*<param\s*name="movie"\s*value="([^"]*)";
		StringaSoundRegex="class="js12">ā.*?name="FlashVars"\s*value="f=([^"]*)";
		StringeSoundRegex="class="js12">ē.*?name="FlashVars"\s*value="f=([^"]*)";
		WebClientclient=newWebClient();
		client.getOptions().setCssEnabled(false);
		client.getOptions().setJavaScriptEnabled(false);
		client.getOptions().(false);
		client.getOptions().(false);
		Pagepage=client.getPage(baseUrl);
		Stringsource=page.getWebResponse().getContentAsString();
		MatchermBihuan=Regex(source,bihuaRegex);
		MatchermA=Regex(source,aSoundRegex);
		MatchermE=Regex(source,eSoundRegex);
		while(mBihuan.find()){
			Stringurl="http://hanyu.iciba.com/"+mBihuan.group(1);
			page=client.getPage(url);
			saveFile(page,"d:/testDownload/bihua.swf");
		}
		while(mA.find()){
			Stringurl=mA.group(1);
			page=client.getPage(url);
			saveFile(page,"d:/testDownload/a.mp3");
		}
		while(mE.find()){
			Stringurl=mE.group(1);
			page=client.getPage(url);
			saveFile(page,"d:/testDownload/e.mp3");
		}
	}
	
	publicstaticMatcherRegex(Stringsource,Stringregex){
		Patternp=Pattern.compile(regex,Pattern.DOTALL);
		returnp.matcher(source);
	}
	
	publicstaticvoidsaveFile(Pagepage,Stringfile)throwsException{
		InputStreamis=page.getWebResponse().getContentAsStream();
FileOutputStreamoutput=newFileOutputStream(file);
IOUtils.(is,output);
output.close();
	}
}

註：附件只是下載下來的文件，並不是代碼。代碼就貼上來的這些。如只是想知道方法，並不需要下載附件。

『貳』 GB2312 的編碼，繁體內容變成亂碼，怎麼解決

第一步：下載htmlunit的源代碼，在com\gargoylesoftware\htmlunit\util目錄下有個EncodingSniffer文件，其中就有獲取頁面編碼的情況，大概在626行encoding = encoding.toUpperCase(Locale.ROOT);後邊添加if(encoding.equals("GB2312"))encoding="GBK";

第二步：大概在715行charset = charset.toUpperCase(Locale.ROOT);後邊添加if(charset.equals("GB2312"))charset="GBK";

原理：gb2312支持的字元集編碼比較小，GBK兼容並且大，可以直接轉GBK的，所以獲取頁面的時候，htmlunit本身會調用這個EncodingSniffer類，將其中遇到gb2312的情況，統一變成gbk。

比較麻煩就是要下載htmlunit源碼，做個編譯後，把生成的EncodingSniffer.class文件覆蓋到maven引用的包對應的class文件中。

『叄』如何從網頁捉取JS動態數據

代碼比較簡單，直接看就可以了，需要注意的是，由於瀏覽器查詢需要時間，在查詢的過程中，應該讓主線程休眠一段時間，才能保證htmlunit瀏覽器已經查詢完畢。

import java.util.concurrent.TimeUnit;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomNodeList;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableBody;

public class Entrance {

public static void main (String[] args ) throws Exception
{
String webUrl = "http://www.xy2046.com/xypk10.aspx?T=234&day=2016-05-29";
HtmlPage page = getHtmlPage(webUrl);
final HtmlTable div = (HtmlTable) page.getElementById("mytable");
HtmlTableBody tbody = (HtmlTableBody) div.getBodies().get(0);
printTable(tbody);
System.err.println("查詢數據成功");
}
答案可在CSDN中找到。

『肆』 java相關。爬蟲問題，關於新浪微博。謝謝！

1.Java中的所有類，必須被裝載到jvm中才能運行，這個裝載工作是由jvm中的類裝載器完專成的，類裝載器所做的工作實質是把類屬文件從硬碟讀取到內存中
2.java中的類大致分為三種：
1.系統類
2.擴展類
3.由程序員自定義的類

3.類裝載方式，有兩種
1.隱式裝載，程序在運行過程中當碰到通過new 等方式生成對象時，隱式調用類裝載器載入對應的類到jvm中。
2.顯式裝載，通過class.forname()等方法，顯式載入需要的類

想必您肯定也上網查過，但是我想具體是什麼機制，屬於內部的機密了吧。畢竟網上查的到的話，那結果可想而知了。

熱點內容

網路中常用的傳輸介質發布：2025-10-20 08:42:23 瀏覽：518

文件如何使用發布：2025-10-20 08:33:27 瀏覽：322

同步推密碼找回發布：2025-10-20 08:04:22 瀏覽：865

樂高怎麼才能用電腦編程序發布：2025-10-20 07:57:56 瀏覽：65

本機qq文件為什麼找不到發布：2025-10-20 07:39:47 瀏覽：264

安卓qq空間免升級發布：2025-10-20 07:36:50 瀏覽：490

linux如何刪除模塊驅動程序發布：2025-10-20 07:36:06 瀏覽：193

at89c51c程序發布：2025-10-20 07:35:06 瀏覽：329

怎麼創建word大綱文件發布：2025-10-20 07:24:54 瀏覽：622

裊裊朗誦文件生成器發布：2025-10-20 07:00:55 瀏覽：626

1054件文件是多少gb 發布：2025-10-20 06:03:27 瀏覽：371

高州禁養區內能養豬多少頭的文件發布：2025-10-20 05:51:26 瀏覽：927

win8ico文件發布：2025-10-20 05:47:08 瀏覽：949

仁和數控怎麼編程發布：2025-10-20 05:24:49 瀏覽：381

項目文件夾圖片發布：2025-10-20 04:42:54 瀏覽：87

怎麼在東芝電視安裝app 發布：2025-10-20 04:42:54 瀏覽：954

plc顯示數字怎麼編程發布：2025-10-20 04:42:54 瀏覽：439

如何辨別假網站發布：2025-10-20 04:26:28 瀏覽：711

寬頻用別人的賬號密碼發布：2025-10-20 04:08:00 瀏覽：556

新app如何佔有市場發布：2025-10-20 03:39:57 瀏覽：42

導航:首頁 > 文件教程 > htmlunit本地文件

htmlunit本地文件

與htmlunit本地文件相關的資料

友情鏈接