This feature is available in Contao 4.9 and later.
As of Contao 4.9, Contao is equipped with an HTTP crawler. Internally, it is built
on top of Escargot.
The crawler essentially just crawls all the URLs generated by Contao as any other
crawler. It follows links that are part of the sitemap.xml
, respects robots.txt
information,
the rel
attribute on links and much more.
While doing this, any number of so-called “subscribers” can subscribe to the results of the HTTP
requests and basically process them further in any way they want. As of today, Contao knows
two subscribers:
search-index
- updates the built-in search index (only available if searching was even enabled)broken-link-checker
- checks all the pages for broken linksAny extension might provide additional subscribers so this list is not necessarily conclusive. Use it as follows:
php vendor/bin/contao-console contao:crawl [options] [<job>]
There is only one argument to this command which is job
. It is optional and represents a job ID.
Crawling can be a very long-running task so in case you want to stop and pick up where you left later,
you need to remember the job ID you were assigned when you first ran the command and you can then
resume with that job ID later on. However, as there is usually no such thing as memory or runtime limits
on CLI, this will likely not be used very often.
The options are far more important so let’s get to them right away:
Option | Description |
---|---|
--subscribers (-s) | By default, all subscribers are enabled but you can pass a comma-separated list of subscribers you want to enable. E.g. if you only want to check for broken links, only pass broken-link-checker here. |
--concurrency (-c) | This option allows to configure the number of concurrent requests. The higher, the faster the process will complete but webservers only handle a certain amount of concurrent requests so choose the value wisely. |
--delay | To make sure you are not hammering a webserver, you can also configure a delay. This is in microseconds and will just cause the crawler to wait n microseconds between requests. |
--max-requests | By default, there is no limit configured but if you want to only execute a certain number of requests in total, you can do that with this option. If you want to pick up the job later, see the job argument. |
--max-depth | This is the tree depth, the crawler is going to search. The root page is basically level 1 and all the links found there will be level 2. All the links found on level 2 will be level 3 and so on. By default, there is no max-depth configured. The higher the number, the deeper the crawler will search but it will also take longer. |
--enable-debug-csv | By default the subscriber results are written to the standard output. You can ask the command to write everything to a CSV file by passing this option. |
--debug-csv-path | This option allows you to override the default CSV file path, if you used --enable-debug-csv . |
Notes
router.request_context.host
parameter.