The DomCrawler Component¶

The DomCrawler component eases DOM navigation for HTML and XML documents.

Note

While possible, the DomCrawler component is not designed for manipulation of the DOM or re-dumping HTML/XML.

Installation¶

1	$ composer require symfony/dom-crawler

Note

If you install this component outside of a Symfony application, you must require the vendor/autoload.php file in your code to enable the class autoloading mechanism provided by Composer. Read this article for more details.

Usage¶

Node Filtering¶

Using XPath expressions, you can select specific nodes within the document:

$crawler = $crawler->filterXPath('descendant-or-self::body/p');

Tip

DOMXPath::query is used internally to actually perform an XPath query.

If you prefer CSS selectors over XPath, install the CssSelector component. It allows you to use jQuery-like selectors to traverse:

$crawler = $crawler->filter('body > p');

An anonymous function can be used to filter with more complex criteria:

use Symfony\Component\DomCrawler\Crawler;
// ...

$crawler = $crawler
    ->filter('body > p')
    ->reduce(function (Crawler $node, $i) {
        // filters every other node
        return ($i % 2) == 0;
    });

To remove a node the anonymous function must return false.

Note

All filter methods return a new Crawler instance with filtered content.

Both the filterXPath() and filter() methods work with XML namespaces, which can be either automatically discovered or registered explicitly.

Consider the XML below:

<?xml version="1.0" encoding="UTF-8"?>
<entry
    xmlns="http://www.w3.org/2005/Atom"
    xmlns:media="http://search.yahoo.com/mrss/"
    xmlns:yt="http://gdata.youtube.com/schemas/2007"
>
    <id>tag:youtube.com,2008:video:kgZRZmEc9j4</id>
    <yt:accessControl action="comment" permission="allowed"/>
    <yt:accessControl action="videoRespond" permission="moderated"/>
    <media:group>
        <media:title type="plain">Chordates - CrashCourse Biology #24</media:title>
        <yt:aspectRatio>widescreen</yt:aspectRatio>
    </media:group>
</entry>

This can be filtered with the Crawler without needing to register namespace aliases both with filterXPath():

$crawler = $crawler->filterXPath('//default:entry/media:group//yt:aspectRatio');

and filter():

$crawler = $crawler->filter('default|entry media|group yt|aspectRatio');

Note

The default namespace is registered with a prefix “default”. It can be changed with the setDefaultNamespacePrefix() method.

The default namespace is removed when loading the content if it’s the only namespace in the document. It’s done to simplify the XPath queries.

Namespaces can be explicitly registered with the registerNamespace() method:

$crawler->registerNamespace('m', 'http://search.yahoo.com/mrss/');
$crawler = $crawler->filterXPath('//m:group//yt:aspectRatio');

Verify if the current node matches a selector:

$crawler->matches('p.lorem');

Node Traversing¶

Access node by its position on the list:

$crawler->filter('body > p')->eq(0);

Get the first or last node of the current selection:

$crawler->filter('body > p')->first();
$crawler->filter('body > p')->last();

Get the nodes of the same level as the current selection:

$crawler->filter('body > p')->siblings();

Get the same level nodes after or before the current selection:

$crawler->filter('body > p')->nextAll();
$crawler->filter('body > p')->previousAll();

Get all the child or parent nodes:

$crawler->filter('body')->children();
$crawler->filter('body > p')->parents();

Get all the direct child nodes matching a CSS selector:

$crawler->filter('body')->children('p.lorem');

Get the first parent (heading toward the document root) of the element that matches the provided selector:

$crawler->closest('p.lorem');

Note

All the traversal methods return a new Crawler instance.

Accessing Node Values¶

Access the node name (HTML tag name) of the first node of the current selection (e.g. “p” or “div”):

// returns the node name (HTML tag name) of the first child element under <body>
$tag = $crawler->filterXPath('//body/*')->nodeName();

Access the value of the first node of the current selection:

// if the node does not exist, calling to text() will result in an exception
$message = $crawler->filterXPath('//body/p')->text();

// avoid the exception passing an argument that text() returns when node does not exist
$message = $crawler->filterXPath('//body/p')->text('Default text content');

// by default, text() trims white spaces, including the internal ones
// (e.g. "  foo\n  bar    baz \n " is returned as "foo bar baz")
// pass FALSE as the second argument to return the original text unchanged
$crawler->filterXPath('//body/p')->text('Default text content', false);

Access the attribute value of the first node of the current selection:

$class = $crawler->filterXPath('//body/p')->attr('class');

Extract attribute and/or node values from the list of nodes:

$attributes = $crawler
    ->filterXpath('//body/p')
    ->extract(['_name', '_text', 'class'])
;

Note

Special attribute _text represents a node value, while _name represents the element name (the HTML tag name).

Call an anonymous function on each node of the list:

use Symfony\Component\DomCrawler\Crawler;
// ...

$nodeValues = $crawler->filter('p')->each(function (Crawler $node, $i) {
    return $node->text();
});

The anonymous function receives the node (as a Crawler) and the position as arguments. The result is an array of values returned by the anonymous function calls.

When using nested crawler, beware that filterXPath() is evaluated in the context of the crawler:

$crawler->filterXPath('parent')->each(function (Crawler $parentCrawler, $i) {
    // DON'T DO THIS: direct child can not be found
    $subCrawler = $parentCrawler->filterXPath('sub-tag/sub-child-tag');

    // DO THIS: specify the parent tag too
    $subCrawler = $parentCrawler->filterXPath('parent/sub-tag/sub-child-tag');
    $subCrawler = $parentCrawler->filterXPath('node()/sub-tag/sub-child-tag');
});

Adding the Content¶

The crawler supports multiple ways of adding the content:

$crawler = new Crawler('<html><body/></html>');

$crawler->addHtmlContent('<html><body/></html>');
$crawler->addXmlContent('<root><node/></root>');

$crawler->addContent('<html><body/></html>');
$crawler->addContent('<root><node/></root>', 'text/xml');

$crawler->add('<html><body/></html>');
$crawler->add('<root><node/></root>');

Note

The addHtmlContent() and addXmlContent() methods default to UTF-8 encoding but you can change this behavior with their second optional argument.

The addContent() method guesses the best charset according to the given contents and defaults to ISO-8859-1 in case no charset can be guessed.

As the Crawler’s implementation is based on the DOM extension, it is also able to interact with native DOMDocument, DOMNodeList and DOMNode objects:

$domDocument = new \DOMDocument();
$domDocument->loadXml('<root><node/><node/></root>');
$nodeList = $domDocument->getElementsByTagName('node');
$node = $domDocument->getElementsByTagName('node')->item(0);

$crawler->addDocument($domDocument);
$crawler->addNodeList($nodeList);
$crawler->addNodes([$node]);
$crawler->addNode($node);
$crawler->add($domDocument);

Manipulating and Dumping a Crawler

These methods on the Crawler are intended to initially populate your Crawler and aren’t intended to be used to further manipulate a DOM (though this is possible). However, since the Crawler is a set of DOMElement objects, you can use any method or property available on DOMElement, DOMNode or DOMDocument. For example, you could get the HTML of a Crawler with something like this:

$html = '';

foreach ($crawler as $domElement) {
    $html .= $domElement->ownerDocument->saveHTML($domElement);
}

Or you can get the HTML of the first node using html():

// if the node does not exist, calling to html() will result in an exception
$html = $crawler->html();

// avoid the exception passing an argument that html() returns when node does not exist
$html = $crawler->html('Default <strong>HTML</strong> content');

Or you can get the outer HTML of the first node using outerHtml():

$html = $crawler->outerHtml();

Expression Evaluation¶

The evaluate() method evaluates the given XPath expression. The return value depends on the XPath expression. If the expression evaluates to a scalar value (e.g. HTML attributes), an array of results will be returned. If the expression evaluates to a DOM document, a new Crawler instance will be returned.

This behavior is best illustrated with examples:

use Symfony\Component\DomCrawler\Crawler;

$html = '<html>
<body>
    <span id="article-100" class="article">Article 1</span>
    <span id="article-101" class="article">Article 2</span>
    <span id="article-102" class="article">Article 3</span>
</body>
</html>';

$crawler = new Crawler();
$crawler->addHtmlContent($html);

$crawler->filterXPath('//span[contains(@id, "article-")]')->evaluate('substring-after(@id, "-")');
/* Result:
[
    0 => '100',
    1 => '101',
    2 => '102',
];
*/

$crawler->evaluate('substring-after(//span[contains(@id, "article-")]/@id, "-")');
/* Result:
[
    0 => '100',
]
*/

$crawler->filterXPath('//span[@class="article"]')->evaluate('count(@id)');
/* Result:
[
    0 => 1.0,
    1 => 1.0,
    2 => 1.0,
]
*/

$crawler->evaluate('count(//span[@class="article"])');
/* Result:
[
    0 => 3.0,
]
*/

$crawler->evaluate('//span[1]');
// A Symfony\Component\DomCrawler\Crawler instance

Links¶

Use the filter() method to find links by their id or class attributes and use the selectLink() method to find links by their content (it also finds clickable images with that content in its alt attribute).

Both methods return a Crawler instance with just the selected link. Use the link() method to get the Link object that represents the link:

// first, select the link by id, class or content...
$linkCrawler = $crawler->filter('#sign-up');
$linkCrawler = $crawler->filter('.user-profile');
$linkCrawler = $crawler->selectLink('Log in');

// ...then, get the Link object:
$link = $linkCrawler->link();

// or do all this at once:
$link = $crawler->filter('#sign-up')->link();
$link = $crawler->filter('.user-profile')->link();
$link = $crawler->selectLink('Log in')->link();

The Link object has several useful methods to get more information about the selected link itself:

// returns the proper URI that can be used to make another request
$uri = $link->getUri();

Note

The getUri() is especially useful as it cleans the href value and transforms it into how it should really be processed. For example, for a link with href="#foo", this would return the full URI of the current page suffixed with #foo. The return from getUri() is always a full URI that you can act on.

Images¶

To find an image by its alt attribute, use the selectImage method on an existing crawler. This returns a Crawler instance with just the selected image(s). Calling image() gives you a special Image object:

$imagesCrawler = $crawler->selectImage('Kitten');
$image = $imagesCrawler->image();

// or do this all at once
$image = $crawler->selectImage('Kitten')->image();

The Image object has the same getUri() method as Link.

Forms¶

Special treatment is also given to forms. A selectButton() method is available on the Crawler which returns another Crawler that matches <button> or <input type="submit"> or <input type="button"> elements (or an <img> element inside them). The string given as argument is looked for in the id, alt, name, and value attributes and the text content of those elements.

This method is especially useful because you can use it to return a Form object that represents the form that the button lives in:

// button example: <button id="my-super-button" type="submit">My super button</button>

// you can get button by its label
$form = $crawler->selectButton('My super button')->form();

// or by button id (#my-super-button) if the button doesn't have a label
$form = $crawler->selectButton('my-super-button')->form();

// or you can filter the whole form, for example a form has a class attribute: <form class="form-vertical" method="POST">
$crawler->filter('.form-vertical')->form();

// or "fill" the form fields with data
$form = $crawler->selectButton('my-super-button')->form([
    'name' => 'Ryan',
]);

The Form object has lots of very useful methods for working with forms:

$uri = $form->getUri();
$method = $form->getMethod();
$name = $form->getName();

The getUri() method does more than just return the action attribute of the form. If the form method is GET, then it mimics the browser’s behavior and returns the action attribute followed by a query string of all of the form’s values.

Note

The optional formaction and formmethod button attributes are supported. The getUri() and getMethod() methods take into account those attributes to always return the right action and method depending on the button used to get the form.

You can virtually set and get values on the form:

// sets values on the form internally
$form->setValues([
    'registration[username]' => 'symfonyfan',
    'registration[terms]'    => 1,
]);

// gets back an array of values - in the "flat" array like above
$values = $form->getValues();

// returns the values like PHP would see them,
// where "registration" is its own array
$values = $form->getPhpValues();

To work with multi-dimensional fields:

<form>
    <input name="multi[]"/>
    <input name="multi[]"/>
    <input name="multi[dimensional]"/>
    <input name="multi[dimensional][]" value="1"/>
    <input name="multi[dimensional][]" value="2"/>
    <input name="multi[dimensional][]" value="3"/>
</form>

Pass an array of values:

// sets a single field
$form->setValues(['multi' => ['value']]);

// sets multiple fields at once
$form->setValues(['multi' => [
    1             => 'value',
    'dimensional' => 'an other value',
]]);

// tick multiple checkboxes at once
$form->setValues(['multi' => [
    'dimensional' => [1, 3] // it uses the input value to determine which checkbox to tick
]]);

This is great, but it gets better! The Form object allows you to interact with your form like a browser, selecting radio values, ticking checkboxes, and uploading files:

$form['registration[username]']->setValue('symfonyfan');

// checks or unchecks a checkbox
$form['registration[terms]']->tick();
$form['registration[terms]']->untick();

// selects an option
$form['registration[birthday][year]']->select(1984);

// selects many options from a "multiple" select
$form['registration[interests]']->select(['symfony', 'cookies']);

// fakes a file upload
$form['registration[photo]']->upload('/path/to/lucas.jpg');

Using the Form Data¶

What’s the point of doing all of this? If you’re testing internally, you can grab the information off of your form as if it had just been submitted by using the PHP values:

$values = $form->getPhpValues();
$files = $form->getPhpFiles();

If you’re using an external HTTP client, you can use the form to grab all of the information you need to create a POST request for the form:

$uri = $form->getUri();
$method = $form->getMethod();
$values = $form->getValues();
$files = $form->getFiles();

// now use some HTTP client and post using this information

One great example of an integrated system that uses all of this is the HttpBrowser provided by the BrowserKit component. It understands the Symfony Crawler object and can use it to submit forms directly:

use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

// makes a real request to an external site
$browser = new HttpBrowser(HttpClient::create());
$crawler = $browser->request('GET', 'https://github.com/login');

// select the form and fill in some values
$form = $crawler->selectButton('Sign in')->form();
$form['login'] = 'symfonyfan';
$form['password'] = 'anypass';

// submits the given form
$crawler = $browser->submit($form);

Selecting Invalid Choice Values¶

By default, choice fields (select, radio) have internal validation activated to prevent you from setting invalid values. If you want to be able to set invalid values, you can use the disableValidation() method on either the whole form or specific field(s):

// disables validation for a specific field
$form['country']->disableValidation()->select('Invalid value');

// disables validation for the whole form
$form->disableValidation();
$form['country']->select('Invalid value');

Resolving a URI¶

New in version 5.1: The UriResolver helper class was added in Symfony 5.1.

The UriResolver class takes an URI (relative, absolute, fragment, etc.) and turns it into an absolute URI against another given base URI:

use Symfony\Component\DomCrawler\UriResolver;

UriResolver::resolve('/foo', 'http://localhost/bar/foo/'); // http://localhost/foo
UriResolver::resolve('?a=b', 'http://localhost/bar#foo'); // http://localhost/bar?a=b
UriResolver::resolve('../../', 'http://localhost/'); // http://localhost/

The DomCrawler Component¶

Installation¶

Usage¶

Node Filtering¶

Node Traversing¶

Accessing Node Values¶

Adding the Content¶

Expression Evaluation¶

Links¶

Images¶

Forms¶

Using the Form Data¶

Selecting Invalid Choice Values¶

Resolving a URI¶

Learn more¶