Manual:Pywikibot/weblinkchecker.py

This page was moved from MetaWiki.
It probably requires cleanup – please feel free to help out. In addition, some links on the page may be red; respective pages might be found at Meta. Remove this template once cleanup is complete.

Overview[edit]

weblinkchecker.py is a script from the Pywikibot which finds broken external links.

weblinkchecker.py can check the following:

All URLs found on a single article
All articles in a category
All articles in one or more namespaces
All articles on the wiki
And much more! Check the list of command-line arguments.

It will only check HTTP and HTTPS links, and it will leave out URLs inside comments and nowiki tags. To speed itself up, it will check up to 50 links at the same time, using multithreading.

The bot will not remove external links by itself, it will only report them; removal would require strong artificial intelligence. It will only report dead links if they have been found unresponsive at least twice, with a default period of at least one week of waiting between the first and the last time. This should help prevent users from removing links due to temporary server failure. Please keep in mind that the bot cannot differentiate between local failures and a server failures, so make sure you're on a stable Internet connection.

The bot will save a history of unavailable links to a .dat the deadlinks subdirectory, e.g. ./deadlinks/deadlinks-wikipedia-de.dat. This file is not intended to be read or modified by humans. The dat file will be written when the bot terminates (because it is done or the user pressed CTRL-C). After a second run (with an appropriate wait between the two), a human-readable list of broken links will be generated as a .txt file.

Usage[edit]

Speculation. If someone is familiar with the technical details, please update this section.

To check for dead links for the first time for all pages on the wiki:

python weblinkchecker.py -start:!

This will add an entry into the .dat file, with a date. If you run this line again, it will add any new dead links that are not already list, or it will remove any existing entries that are now working.

After the bot has checked some pages, run it on these pages again at a later time. This can be done with this command:

python weblinkchecker.py -repeat

If the bot finds a broken link that has been broken for at least one week, it will log it in a text file, e.g. deadlinks/results-wikipedia-de.txt. The written text has a format that is suitable for posting it on the wiki, so that others can help you to fix or remove the broken links from the wiki pages.

Additionally, it's possible to report broken links to the talk page of the article in which the URL was found (again, only once the linked page has been unavailable at least twice in at least one week). To use this feature, set report_dead_links_on_talk = True in your user-config.py.

Reports will include a link to the Internet Archive Wayback Machine if available, so that important references can be kept.

Syntax examples[edit]

python weblinkchecker.py -start:!

Loads all wiki pages in alphabetical order using the Special:Allpages feature.

python weblinkchecker.py -start:Example_page

Loads all wiki pages using the Special:Allpages feature, starting at "Example page"

python weblinkchecker.py -weblink:www.example.org

Loads all wiki pages that link to www.example.org

python weblinkchecker.py Example page
python weblinkchecker.py -page:Example page

Only checks links found in the wiki page "Example page"

python weblinkchecker.py -repeat

Loads all wiki pages where dead links were found during a prior run

Command-line arguments[edit]

the following list was extracted from the bot's help (using python weblinkchecker.py -help). It is in addition to the global arguments used by most bots. Most of the arguments on the list were not verified, and it should probably be re-arranged in a more logical order.

Parameter	explanation
-cat	Work on all pages which are in a specific category. Argument can also be given as "-cat:categoryname" or as "-cat:categoryname\|fromtitle" (using # instead of \| is also allowed in this one and the following)
-catr	Like -cat, but also recursively includes pages in subcategories, sub-subcategories etc. of the given category. Argument can also be given as "-catr:categoryname" or as "-catr:categoryname\|fromtitle".
-subcats	Work on all subcategories of a specific category. Argument can also be given as "-subcats:categoryname" or as "-subcats:categoryname\|fromtitle".
-subcatsr	Like -subcats, but also includes sub-subcategories etc. of the given category. Argument can also be given as "-subcatsr:categoryname" or as "-subcatsr:categoryname\|fromtitle".
-uncat	Work on all pages which are not categorised.
-uncatcat	Work on all categories which are not categorised.
-uncatfiles	Work on all files which are not categorised.
-uncattemplates	Work on all templates which are not categorised.
-file	Read a list of pages to treat from the named text file. Page titles in the file must be enclosed with [[brackets]] or separated by newlines. Argument can also be given as "-file:filename".
-filelinks	Work on all pages that use a certain image/media file. Argument can also be given as "-filelinks:filename".
-search	Work on all pages that are found in a MediaWiki search across all namespaces.
-namespace -ns	Filter the page generator to only yield pages in the specified namespaces. Separate multiple namespace numbers with commas. Example "-ns:0,2,4"
-interwiki	Work on the given page and all equivalent pages in other languages. This can, for example, be used to fight multi-site spamming. Attention: this will cause the bot to modify pages on several wiki sites, this is not well tested, so check your edits!
-limit:n	When used with any other argument that specifies a set of pages, work on no more than n pages in total
-links	Work on all pages that are linked from a certain page. Argument can also be given as "-links:linkingpagetitle".
-imagelinks	Work on all images that are linked from a certain page. Argument can also be given as "-imagelinks:linkingpagetitle".
-newimages	Work on the 100 newest images. If given as -newimages:x, will work on the x newest images.
-new	Work on the 60 recent new pages. If given as -new:x, will work on the x newest pages.
-recentchanges	Work on new and edited pages returned by Special:Recentchanges. Can also be given as "-recentchanges:n" where n is the number of pages to be returned, else 100 pages are returned.
-ref	Work on all pages that link to a certain page. Argument can also be given as "-ref:referredpagetitle".
-start	Specifies that the robot should go alphabetically through all pages on the home wiki, starting at the named page. Argument can also be given as "-start:pagetitle". You can also include a namespace. For example, "-start:Template:!" will make the bot work on all pages in the template namespace.
-prefixindex	Work on pages commencing with a common prefix.
-titleregex	Work on titles that match the given regular expression.
-transcludes	Work on all pages that use a certain template. Argument can also be given as "-transcludes:Title".
-unusedfiles	Work on all description pages of images/media files that are not used anywhere. Argument can be given as "-unusedfiles:n" where n is the maximum number of articles to work on.
-unwatched	Work on all articles that are not watched by anyone. Argument can be given as "-unwatched:n" where n is the maximum number of articles to work on.
-usercontribs	Work on articles that were edited by a certain user. Example: -usercontribs:DumZiBoT Normally up to 250 distinct pages are given. To get an other number of pages, add the number behind the username delimited with ";" Example: -usercontribs:DumZiBoT;500 returns 500 distinct pages to work on.
-<mode>log	Work on articles that were on a specified special:log. You have options for every type of logs given by the <mode> parameter which could be one of the following: block, protect, rights, delete, upload, move, import, patrol, merge, suppress, review, stable, gblblock, renameuser, globalauth, gblrights, abusefilter, newusers. Examples: -movelog gives 500 pages from move log (should be redirects) -deletelog:10 gives 10 pages from deletion log -protect:Dummy gives 500 pages from protect by user Dummy -patrol:Dummy;20 gives 20 pages patroled by user Dummy (in some cases this must be written as -patrol:"Dummy;20")
-weblink	Work on all articles that contain an external link to a given URL; may be given as "-weblink:url"
-withoutinterwiki	Work on all pages that don't have interlanguage links. Argument can be given as "-withoutinterwiki:n" where n is some number (??).
-random	Work on random pages returned by Special:Random. Can also be given as "-random:n" where n is the number of pages to be returned, else 10 pages are returned.
-randomredirect	Work on random redirect target pages returned by Special:Randomredirect. Can also be given as "-randomredirect:n" where n is the number of pages to be returned, else 10 pages are returned.
-gorandom	Specifies that the robot should starting at the random pages returned by Special:Random.
-redirectonly	Work on redirect pages only, not their target pages. The robot goes alphabetically through all redirect pages on the wiki, starting at the named page. The argument can also be given as "-redirectonly:pagetitle". You can also include a namespace. For example, "-redirectonly:Template:!" will make the bot work on all redirect pages in the template namespace.
-google	Work on all pages that are found in a Google search. You need a Google Web API license key. Note that Google doesn't give out license keys anymore. See google_key in config.py for instructions. Argument can also be given as "-google:searchstring".
-yahoo	Work on all pages that are found in a Yahoo search. Depends on python module pYsearch. See yahoo_appid in config.py for instructions.
-page	Work on a single page. Argument can also be given as "-page:pagetitle".
-repeat	Work on all pages were dead links were found before. This is useful to confirm that the links are dead after some time (atleast one week), which is required before the script will report the problem.
-namespace	Only process pages in the namespace with the given number or name. This parameter may be used multiple times.
-ignore	HTTP return codes to ignore. Can be provided several times: -ignore:401 -ignore:500
Furthermore, the following command line arguments are supported:
-talk	Overrides the report_dead_links_on_talk config variable, enabling the feature.
-notalk	Overrides the report_dead_links_on_talk config variable, disabling the feature.
-day	the first time found dead link longer than x day ago, it should probably be fixed or removed. if no set, default is 7 day.

All other arguments will be regarded as part of the title of a single page, and the bot will only work on that single page.

Configuration variables[edit]

The following config variables (to be declared in user-config.py) are supported by this script:

Parameter	Explanation
max_external_links	The maximum number of web pages that should be loaded simultaneously. You should change this according to your Internet connection speed. Be careful: if it is set too high, the script might get socket errors because your network is congested, and will then think that the page is offline.
report_dead_links_on_talk	If set to true, causes the script to report dead links on the article's talk page if (and ONLY if) the linked page has been unavailable at least two times during a timespan of at least one week.

Manual:Pywikibot/weblinkchecker.py

Contents

Overview[edit]

Usage[edit]

Syntax examples[edit]

Command-line arguments[edit]

Configuration variables[edit]

See also[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Support

Development

MediaWiki.org

Print/export

Tools

In other languages