Google Website Translator Gadget

Sunday, January 23, 2011

Downloading web pages using ubuntu, cron, wget and curl

Here is a short tutorial for setting up an automated scheduled download of web pages using several free open source software. For this tutorial, I will be automating the daily download of the foreign exchange rates from Bank Negara Malaysia (Malaysia's central bank) website. This is a very good method for researcher / analyst to automate the task of collating daily / weekly / monthly data.

The free open source software :
  1. Ubuntu Linux 10.04 : http://www.ubuntu.com/ (this is a very popular and easy to operate system which is an alternative to MS Windows)
  2. Cron : https://help.ubuntu.com/community/CronHowto (this software is part of the Ubuntu Linux system and is used to automate scheduled tasks)
  3. Wget : http://manpages.ubuntu.com/manpages/hardy/man1/wget.1.html (this software is also part of the Ubuntu Linux system and is a downloader of files)
  4. Curl : http://curl.haxx.se/ (Curl is an optional software on the Ubuntu Linux system and is a downloader/uploader of files)
Downloading and installing Ubuntu Linux
  1. Get a computer to use for this purpose, a second hand 5 year old computer with 512MB RAM and 200MB HDD should be good enough. But I would recommend a new computer with 2GB RAM if you wish to have other software to analyse the data. When buying the computer, you do not need to buy MS Windows (this will usually save you RM300) or any other software.
  2. Download and follow the instructions for installing Ubuntu Linux from the website : http://www.ubuntu.com/desktop/get-ubuntu/download . If you need any help, you can contact other Malaysian users from http://ubuntu.com.my/
  3. When installing Ubuntu Linux, it will also install cron and wget.
  4. After you have installed Ubuntu Linux, login to Ubuntu.
  5. From the Ubuntu desktop bar, click on Applications > Accessories > Terminal. This will open up a command-line terminal.
  6. In the Terminal, type :
    sudo apt-get install curl
    (when prompted enter your password) and this will automatically download and install the curl software for you.

Screenshot of Ub
Setting up cron, wget and curl to download a web page
  1. Open a Terminal by clicking on the Ubuntu desktop bar Applications > Accessories > Terminal
  2. Add a cron entry by typing :
    crontab -e
    then press the [Enter] key (when prompted enter your password)
  3. If this is the first time for you editing a cron entry, cron will ask which editor to use. I use Nano which is the easiest default option.
  4. In the Nano editor, use the arrow keys to move to the first empty row, then type the command to download using wget :
    15 18 * * 1-5 wget http://www.bnm.gov.my/statistics/exchangerates.php
    This will tell wget to download the web page exchangerates.php from the www.bnm.gov.my website 15 minutes past 18th hour (which is 6.15pm), every day, every month, on Mondays to Fridays (which are weekdays).
  5. Press the [Enter] key to move to a new row, then type the command to rename the downloaded file according to date:
    18 18 * * 1-5 mv exchangerates.php exchangerates_wget_`date +\%Y\%m\%d`.php
    Note to use the [`] key besides the number [1] key and not the single quote ['] key besides the [Enter] key. This will rename the downloaded exchangerates.php file to exchangerates_wget_20110123.php (assuming the current date).
  6. Press the [Enter] key to move to a new row, then type the command to download using curl :
    20 18 * * 1-5 curl -O http://www.bnm.gov.my/statistics/exchangerates.php
    This will tell curl to download and save the web page exchangerates.php from the www.bnm.gov.my website 20 minutes past 18th hour (which is 6.20pm), every day, every month on Mondays to Fridays (which are weekdays).
  7. Press the [Enter] key to move to a new row, then type the command to rename the downloaded file according to date:
    23 18 * * 1-5 mv exchangerates.php exchangerates_curl_`date +\%Y\%m\%d`.php
    Note to use the [`] key besides the number [1] key and not the single quote ['] key besides the [Enter] key. This will rename the downloaded exchangerates.php file to exchangerates_curl_20110123.php (assuming the current date).
  8. Save the cron entries by pressing the [Ctrl] and [o] keys at the same time, then when prompted, press the [Enter] key.
  9. Exit the Nano editor by pressing the [Ctrl] and [x] keys at the same time.
  10. To exit the Terminal, type :
    exit
    then press the [Enter] key which will close the Terminal window
Checking the automated scheduled downloads
  1. You can check whether the automated scheduled downloads at 6.30pm
  2. Open a Terminal by clicking on the Ubuntu desktop bar Applications > Accessories > Terminal
  3. Display a list of files by typing :
    ls
    then press the [Enter] key (when prompted enter your password)
  4. From the list, you should be able to see a list containing the renamed downloaded files
  5. To exit the Terminal, type :
    exit
    then press the [Enter] key which will close the Terminal window
In the tutorial above, you can note that wget and curl is doing the same thing. However, curl has an advantage for certain websites where a username and password is required, or where certain parameters are required like selection of date and output format.

There are many more interesting websites with time-series data where it is possible to automate the task of collecting the data like finance.yahoo.com, finance.google.com, worldbank.org, etc.

In the next post, I will provide another short tutorial for processing the data so that it can be read using a spreadsheet program which is also free open source software.

1 comment:

ezani said...

Nice topic Raja and good to have you back home! EZANI