Last updated: 2004-06-11 23:00
NAME
todo.pl - Perl download manager for easier file slurping
DESCRIPTION
todo.pl takes a list of URIs and command directives from a TODO queue file and retrieves those files, resuming aborted downloads when possible.
It was originally created to enable easy slurping of image galleries where filenames were like pic1.jpg, pic2.jpg ... pic54.jpg, but has since been put to use grabbing all manner of resources from various websites.
OPTIONS
Except for -t, all switches are overridden by their corresponding todo file command directives. See the next section for information on command directives.
- -t, --todo
- The path to the todo file to slurp. Defaults to '_todo'.
- -p, --path
- Sets the download directory for all slurped files.
- --delay
- The delay, in seconds, between requests.
- --loopdelay
- Time, in seconds, to wait before retrying items skipped (because of error) on the previous pass over the list. Setting this to zero disables loop-until-done behavior, exiting after one pass over the download queue.
- -s, --skip
- Skips to next queued site when any error occurs. By default, URIs resulting in 404 errors are removed from the queue and the next item from the same site is fetched.
- -v, --verbose
- Controls the amount of feedback the script gives you. Each -v switch increases verbosity one level, though one should be plenty. The DEBUG command directive requires an integer from 0-3.
- -l, --logfile
- Path of file for transaction logging. Default is '_log'.
- --loglevel
- Controls the verbosity of log file logging if a log file has been specified. Accepts an integer from 0-3. Default is 0 (disabled).
TODO FILE FORMAT
The TODO file is simply a list of URLs to fetch, with some useful extensions.
Whitespace lines and comments (lines beginning with #) are ignored. Lines that are not command directives are treated as URIs to fetch.
- PATH /download/dir
- Sets download directory. Equivalent to the --path switch.
- DELAY <seconds>
- Equivalent to --delay.
- LOOPDELAY <seconds>
- Equivalent to --loopdelay.
- SKIP <1=enable, 0=disable>
- Equivalent to --skip.
- DEBUG <debug level>
- Equivalent to -v, --verbose, though you must specify the verbosity level of 0-3.
- LOGFILE /path/to/log.txt
- Equivalent to --logfile.
- LOGLEVEL <log level>
- Equivalent to --loglevel.
- PREFIX foo bar
- Prepends ``foo bar'' to all subsequent filenames An empty prefix is, like, no prefix.
- REFERER http://foobar.com/page.html
- Specifies a URI to send in the Referer header, which some sites use to try and stop remote linking. This header will be sent for all subsequent URIs until an empty REFERER directive is encountered. By default the current item's own URI is sent as referer.
-
Special referer types:
- REFERER HOST
- The hostname of the resource's web server
- REFERER PARENT
- The parent directory of the resource
- REFERER ROOT
- The base directory (as specified by ROOT command directive)--often the same as PARENT
- ROOT http://foo.bar.com/dir/
- All subsequent relative URLs will be prepended with this base.
-
ROOT http://foo.bar.com/dir/ file1.ext file2.ext -
is equivalent to
-
http://foo.bar.com/dir/file1.ext http://foo.bar.com/dir/file2.ext - END
- The file is not processed beyond this point
- URL Expansion
- Bracketed expressions allow you to specify a large quantity of sequentially numbered (or lettered) files without cramping your fingers.
-
http://foo.com/bob[1-2]/[001-003].jpg becomes -
http://foo.com/bob1/001.jpg http://foo.com/bob1/002.jpg http://foo.com/bob1/003.jpg http://foo.com/bob2/001.jpg http://foo.com/bob2/002.jpg http://foo.com/bob2/003.jpg -
A filename prefix may be specified in any or all brackets like this: [foo_:001-010], which will add ``foo_'' to each filename. Prefixes are additive and are appended to PREFIX if specified.
-
Additionally, because Perl can increment strings, bracketed expressions can contain letters or any other word (\w) characters. See the sample below for an example.
EXAMPLE
Here's a sample TODO queue file.
PATH /downloads
http://domain1.com/dir[cool:04-05]/image_[beans:a-c].jpg
ROOT http://domain2.com/files/
document1.doc
PREFIX Shazam-
document2.doc
PREFIX
http://domain3.com/ford.mpg
http://domain4.com/arthur.mpg
The above queue will fetch the following URIs--
http://domain.com/dir04/image_a.jpg
http://domain.com/dir04/image_b.jpg
http://domain.com/dir04/image_c.jpg
http://domain.com/dir05/image_a.jpg
http://domain.com/dir05/image_b.jpg
http://domain.com/dir05/image_c.jpg
http://domain2.com/files/document1.doc
http://domain2.com/files/document1.doc
http://domain3.com/ford.mpg
http://domain4.com/arthur.mpg
--which will be saved as--
/download/Photos-cool_beans_image_a.jpg
/download/Photos-cool_beans_image_b.jpg
/download/Photos-cool_beans_image_c.jpg
/download/Photos-cool_beans_image_a.jpg
/download/Photos-cool_beans_image_b.jpg
/download/Photos-cool_beans_image_c.jpg
/download/Photos-document1.doc
/download/Shazam-document2.doc
/download/ford.mpg
/download/arthur.mpg
BUGS
The LWP library does a bad job resuming FTP tansfers, so URIs using that protocol will be restarted from scratch when restarting instead of resumed like HTTP transfers.
The new single-line download status display will possibly break on some platforms. I've only tested it on win32.
LICENSE
Copyright 2000-2004 by Coke Harrington <coke@cokesque.com>
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA