Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

技能树升级——Chrome Headless模式 #10

Open
yesvods opened this issue Apr 14, 2017 · 6 comments
Open

技能树升级——Chrome Headless模式 #10

yesvods opened this issue Apr 14, 2017 · 6 comments

Comments

@yesvods
Copy link
Owner

yesvods commented Apr 14, 2017

也许最近已经听说Chrome59将支持headless模式,PhantomJS核心开发者Vitaly表示自己将会失业了。

Headless模式解决了什么问题

3年前,无头浏览器PhantomJS已经如火如荼出现了,紧跟着NightmareJS也成为一名巨星。无头浏览器带来巨大便利性:页面爬虫、自动化测试、WebAutomation...

为啥Chrome又插了一脚

用过PhantomJS的都知道,它的环境是运行在一个封闭的沙盒里面,在环境内外完全不可通信,包括API、变量、全局方法调用等。一个之前写的微信页面爬虫,实现内外通信的方式极其Hack,为了达到目的,不择手段,令人发指,看过的哥们都会蛋疼。

So, 很自然的,Chrome59版支持的特性,全部可以利用,简直不要太爽:

  • ES2017
  • ServiceWork(PWA测试随便耍)
  • 无沙盒环境
  • 无痛通讯&API调用
  • 无与伦比的速度
  • ...

技能树启动点

为了点亮技能树,我们需要以下配置:

大致来说,有那么个过程:

启动Headless模式

有各种脚本启动方式,本次我们使用termial参数方式来打开:

$ /Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --headless --remote-debugging-port=9222

在Dock中,一个黄色的东西就会被启动,但是他不会跳出来。

操控无头浏览器

依旧有各种方式,我们先安装一个工具帮助我们来对黄色浏览器做点事情:

$ tnpm i -S chrome-remote-interface 

燥起来

捕获所有请求

Pretty Simple,写一个index.js:

const CDP = require("chrome-remote-interface");
 
CDP(client => {
  // extract domains
  const { Network, Page } = client;
  // setup handlers
  Network.requestWillBeSent(params => {
    console.log(params.request.url);
  });
  Page.loadEventFired(() => {
    client.close();
  });
  // enable events then start!
  Promise.all([Network.enable(), Page.enable()])
    .then(() => {
      return Page.navigate({ url: "https://github.com" });
    })
    .catch(err => {
      console.error(err);
      client.close();
    });
}).on("error", err => {
  // cannot connect to the remote endpoint
  console.error(err);
});

AND run it:

$ node index.js

结果会展示一堆url:

https://github.com/
https://assets-cdn.github.com/assets/frameworks-12d63ce1986bd7fdb5a3f4d944c920cfb75982c70bc7f75672f75dc7b0a5d7c3.css
https://assets-cdn.github.com/assets/github-2826bd4c6eb7572d3a3e9774d7efe010d8de09ea7e2a559fa4019baeacf43f83.css
https://assets-cdn.github.com/assets/site-f4fa6ace91e5f0fabb47e8405e5ecf6a9815949cd3958338f6578e626cd443d7.css
https://assets-cdn.github.com/images/modules/site/home-illo-conversation.svg
https://assets-cdn.github.com/images/modules/site/home-illo-chaos.svg
https://assets-cdn.github.com/images/modules/site/home-illo-business.svg
https://assets-cdn.github.com/images/modules/site/integrators/slackhq.png
https://assets-cdn.github.com/images/modules/site/integrators/zenhubio.png
https://assets-cdn.github.com/assets/compat-8a4318ffea09a0cdb8214b76cf2926b9f6a0ced318a317bed419db19214c690d.js
https://assets-cdn.github.com/static/fonts/roboto/roboto-medium.woff
...

捕获DOM内所有图片

这次轮到演示一下如何操控DOM:

const CDP = require("chrome-remote-interface");
 
CDP(chrome => {
  chrome.Page
    .enable()
    .then(() => {
      return chrome.Page.navigate({ url: "https://github.com" });
    })
    .then(() => {
      chrome.DOM.getDocument((error, params) => {
        if (error) {
          console.error(params);
          return;
        }
        const options = {
          nodeId: params.root.nodeId,
          selector: "img"
        };
        chrome.DOM.querySelectorAll(options, (error, params) => {
          if (error) {
            console.error(params);
            return;
          }
          params.nodeIds.forEach(nodeId => {
            const options = {
              nodeId: nodeId
            };
            chrome.DOM.getAttributes(options, (error, params) => {
              if (error) {
                console.error(params);
                return;
              }
              console.log(params.attributes);
            });
          });
        });
      });
    });
}).on("error", err => {
  console.error(err);
});

最后会返回数组,看起来像酱紫:

[
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/home-illo-conversation.svg',
    'alt',
    '',
    'width',
    '360',
    'class',
    'd-block width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/home-illo-chaos.svg',
    'alt',
    '',
    'class',
    'd-block width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/home-illo-business.svg',
    'alt',
    '',
    'class',
    'd-block width-fit mx-auto mb-4' ]
    ...
]

chrome-remote-interface 提供一套完整的API用于利用全量Chrome特性,更多使用方法参考:https://github.com/cyrus-and/chrome-remote-interface

总结

Chrome Headless特性,不仅仅革新了原有格局,而且提高开发效率,降低使用门槛,对于经常使用爬虫、自动化测试前端童鞋来说简直是巨大福音,对于新童鞋来说也是一个新潮的玩具。

@xiatian
Copy link

xiatian commented May 6, 2017

没找到Linux版本呢?

1 similar comment
@wxnet2013
Copy link

没找到Linux版本呢?

@JustinYi922
Copy link

有Ubuntu版本,参考这里https://medium.com/@dschnr/using-headless-chrome-as-an-automated-screenshot-tool-4b07dffba79a。
我现在找找怎么在centos上实现。

@yesvods
Copy link
Owner Author

yesvods commented May 16, 2017

# Install Google Chrome
# https://askubuntu.com/questions/79280/how-to-install-chrome-browser-properly-via-command-line
sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb  # Might show "errors", fixed by next line
sudo apt-get install -f

@xiatian @wxnet2013

@AllenCHM
Copy link

可以直接通过aptitude install google-chrome 安装。

@wx7614140
Copy link

最近看selenium webdriver ,发现chromdriver 可以设置headless.
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setHeadless(true);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants