Question

Puppeteer: Grabbing entire html from page that uses lazy load

*
685 visibility 0 arrow_circle_up 0 arrow_circle_down

I am trying to grab the entire html on a web page that uses lazy load. What I have tried is scrolling all the way to the bottom and then use page.content(). I have also tried scrolling back to the top of the page after I scrolled to the bottom and then use page.content(). Both ways grabs some rows of the table, but not all of them, which is my main goal. I believe that the web page uses lazy loading from react.js.

const puppeteer = require('puppeteer');
const url = 'https://www.torontopearson.com/en/departures';
const fs = require('fs');

puppeteer.launch().then(async browser => {
    const page = await browser.newPage();
    await page.goto(url);
    await page.waitFor(300);

    //scroll to bottom
    await autoScroll(page);
    await page.waitFor(2500);

    //scroll to top of page
    await page.evaluate(() => window.scrollTo(0, 50));

    let html = await page.content();

    await fs.writeFile('scrape.html', html, function(err){
        if (err) throw err;
        console.log("Successfully Written to File.");
    });
    await browser.close();
});

//method used to scroll to bottom, referenced from user visualxcode on https://github.com/GoogleChrome/puppeteer/issues/305
async function autoScroll(page){ 
    await page.evaluate(async () => {
        await new Promise((resolve, reject) => {
            var totalHeight = 0;
            var distance = 300;
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;

                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 100);
        });
    });
}

    Answer - 1
    0 arrow_circle_up 0 arrow_circle_down

    The problem is that the linked page is using the library react-virtualized. This library only renders the visible part of the website. Therefore you cannot get the whole table at once. Crawling to the bottom of the table will only put the bottom part of the table in the DOM.

    To check where the page loads its content from, you should check the network tab of the DevTools. You will notice that the content of the page is loaded from this URL, which seems to provide a perfect representation of the DOM in JSON format. So, there is really no need to scrape that data from the page. You can just use the URL.

    by   *

      Answer - 2
      0 arrow_circle_up 0 arrow_circle_down

      I am not well good in this but After searching so long i found one solution gives good results for one my requirement. Here is the piece of code i used to handle lazy-load scenarios.

      const bodyHandle = await page.$('body');
      const { height } = await bodyHandle.boundingBox();
      await bodyHandle.dispose();
      console.log('Handling viewport...')
      const viewportHeight = page.viewport().height;
      let viewportIncr = 0;
      while (viewportIncr + viewportHeight < height) {
      await page.evaluate(_viewportHeight => {
      window.scrollBy(0, _viewportHeight);
      }, viewportHeight);
      await wait(30);
      viewportIncr = viewportIncr + viewportHeight;
      }
      console.log('Handling Scroll operations')
      await page.evaluate(_ => {
      window.scrollTo(0, 0);
      });
      await wait(100);  
      await page.screenshot({path: 'GoogleHome.jpg', fullPage: true});
      

      From this am able to take long screenshots even. Hope this will help you.

      by   *