Thursday, February 24, 2011

Download web content from java program



Downloading the content from java program might be easy. Here is one more example with proxy settings.

I am using htmlUnit.jar and other jars to get through.

import java.io.IOException;

import org.xml.sax.SAXException;

import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebForm;
import com.meterware.httpunit.WebRequest;
import com.meterware.httpunit.WebResponse;

public class start {

public static void main(String[] args) throws IOException, SAXException {
WebConversation conversation = new WebConversation();
conversation.setProxyServer("myproxy.proxy.com", 9087,"domain\\username" ,"password");


WebRequest request = new GetMethodWebRequest("http://www.google.com");

WebResponse response = conversation.getResponse(request);

System.out.println(response.getText());

}
}

output prints the google's page with all text as of page source.

Thanks for reading. Bye !!

How to use wget behind proxy


wget is most powerful at the same time handy tool to crawl a small part of web. Restricted crawling is made easier when wget was invented.

Using wget behind the proxy can be a bit tricky. Here is the sample command

For windows
cmd> wget www.google.com
......................
....................
... failed: Connection Refused.

If you are getting the same error try:

cmd> wget -U="Firefox" www.google.com

Sometimes the proxy has a small setting which checks the user agent only. You may try crack it using -U option. If this also does not work try,

cmd> set http_proxy=myproxy.proxy.com
cmd> wget --proxy=on www.google.com

You should be able to download index.html. But if this is not working because proxy requires username and password, then try

cmd> set http_proxy=myproxy.proxy.com
cmd> wget --proxy=on www.google.com
cmd> wget --proxy-user "domain\foo" --proxy-password="chocolate" www.google.com

If this still does not work, then sorry you are on your on !!

Hope this is helpful. To figure out proxy settings check Internet Options or simple google it.