java抓取页面中文乱码解决方法

java程序在抓取url页面时,有时会遇到中文输出乱码的问题,主要原因是编码格式不匹配所导致。大部分网页以utf8编码格式存储,而通过网络抓取页面时,将utf8作为字节流形式传输到本地,因此需要将字节流转换回utf8编码的文本。如果不转换,或者转换成其他编码格式,就会出现中文乱码。

下面是我原来写的代码:

// 获得抓取网页的源码

public String getdata(String url) {

String data = null;

org.apache.commons.httpclient.HttpClient client = new HttpClient();

GetMethod getMethod = new GetMethod(url);

getMethod

.setRequestHeader("User_Agent",

"Mozilla/5.0(Windows NT 6.1;Win64;x64;rv:39.0) Gecko/20100101 Firefox/39.0");

getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,

new DefaultHttpMethodRetryHandler());// 系统默认的恢复策略

try {

int statusCode = client.executeMethod(getMethod);

if (statusCode != HttpStatus.SC_OK) {

System.out.println("Wrong");

}

byte[] responseBody = getMethod.getResponseBody();

data = new String(responseBody);

return data;

} catch (HttpException e) {

System.out.println("Please check your provided http address!");

data = "";

e.printStackTrace();

} catch (IOException e) {

data = "";

e.printStackTrace();

} finally {

getMethod.releaseConnection();

}

return data;

}

大家注意我标红的地方,这样写执行程序的时候,所有中文都会显示乱码,打印出来如下图:

修改代码,使用utf编码格式, String data = new String(responseBody,"utf8");

中文显示正常 ,完整代码如下,注意标红的部分:

// 获得源码

public String getdata(String url) {

String data = null;

org.apache.commons.httpclient.HttpClient client = new HttpClient();

GetMethod getMethod = new GetMethod(url);

getMethod

.setRequestHeader("User_Agent",

"Mozilla/5.0(Windows NT 6.1;Win64;x64;rv:39.0) Gecko/20100101 Firefox/39.0");

getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,

new DefaultHttpMethodRetryHandler());// 系统默认的恢复策略

try {

int statusCode = client.executeMethod(getMethod);

if (statusCode != HttpStatus.SC_OK) {

System.out.println("Wrong");

}

byte[] responseBody = getMethod.getResponseBody();

data = new String(responseBody, "utf8");

return data;

} catch (HttpException e) {

System.out.println("Please check your provided http address!");

data = "";

e.printStackTrace();

} catch (IOException e) {

data = "";

e.printStackTrace();

} finally {

getMethod.releaseConnection();

}

return data;

}

执行代码后,打印出来如下图所示:

问题解决。

(0)

相关推荐