contao:crawl

This feature is available in Contao 4.9 and later.

As of Contao 4.9, Contao is equipped with an HTTP crawler. Internally, it is built on top of Escargot. The crawler essentially just crawls all the URLs generated by Contao as any other crawler. It follows links that are part of the sitemap.xml, respects robots.txt information, the rel attribute on links and much more. While doing this, any number of so-called “subscribers” can subscribe to the results of the HTTP requests and basically process them further in any way they want. As of today, Contao knows two subscribers:

  • search-index - updates the built-in search index (only available if searching was even enabled)
  • broken-link-checker - checks all the pages for broken links

Any extension might provide additional subscribers so this list is not necessarily conclusive. Use it as follows:

php vendor/bin/contao-console contao:crawl [options] [<job>]

There is only one argument to this command which is job. It is optional and represents a job ID. Crawling can be a very long-running task so in case you want to stop and pick up where you left later, you need to remember the job ID you were assigned when you first ran the command and you can then resume with that job ID later on. However, as there is usually no such thing as memory or runtime limits on CLI, this will likely not be used very often.

The options are far more important so let’s get to them right away:

OptionDescription
--subscribers (-s)By default, all subscribers are enabled but you can pass a comma-separated list of subscribers you want to enable. E.g. if you only want to check for broken links, only pass broken-link-checker here.
--concurrency (-c)This option allows to configure the number of concurrent requests. The higher, the faster the process will complete but webservers only handle a certain amount of concurrent requests so choose the value wisely.
--delayTo make sure you are not hammering a webserver, you can also configure a delay. This is in microseconds and will just cause the crawler to wait n microseconds between requests.
--max-requestsBy default, there is no limit configured but if you want to only execute a certain number of requests in total, you can do that with this option. If you want to pick up the job later, see the job argument.
--max-depthThis is the tree depth, the crawler is going to search. The root page is basically level 1 and all the links found there will be level 2. All the links found on level 2 will be level 3 and so on. By default, there is no max-depth configured. The higher the number, the deeper the crawler will search but it will also take longer.
--enable-debug-csvBy default the subscriber results are written to the standard output. You can ask the command to write everything to a CSV file by passing this option.
--debug-csv-pathThis option allows you to override the default CSV file path, if you used --enable-debug-csv.

Notes

  • Make sure you have defined the correct domain in either your website root or a default domain via the router.request_context.host parameter.
  • Protected pages can currently only be indexed via the backend - Enable indexing via config.yaml.
  • If a web page is protected via “Basic Authentication” in the production or staging environment before publication, access for the crawler can be done via config.yaml.