phantom of the web - meetupfiles.meetup.com/2810092/phantom of the web - 2016-03-08.pdf · outline...
TRANSCRIPT
Phantom of the Web
Dušan Omerčević / SqurbMarch 8th, 2016
Outline of the talk
- Personal introduction- What problem are we solving- Introduction to PhantomJS- PhantomJS tips & tricks- Aggregating data behind login screens
Dušan Omerčević, M.Sc.
- Founder of Squrb- Lead engineering and product development at Zemanta- Head of Software development at Najdi.si- Researcher in the field of computer vision- Led several large software development projects (e.g., highway traffic
management system, electronic toll collection)
What problem are we solving
Online Services Are On a Roll, Take Back Control
Tracking usage and costs of online services
Sources of usage and costs data
- official APIs (<10% services support it)- credit card statements (costs only)- ERPs (costs only)- emails (costs only)- aggregate data hidden behind login screens (usage and costs data)
- Log in to a dashboard- Retrieve usage data- Retrieve costs data
Given the proliferation of rich, javascript-based web applications it is no longer feasible to parse HTML returned by the server.
Introduction to PhantomJS
PhantomJS
A headless WebKit scriptable with a JavaScript API.
Use cases:
- Headless website testing (Jasmine, QUnit, Mocha, …)- Web crawling- Screen capture- Page automation- Network monitoring and performance testing (YSlow)- Server rendering of client-side JavaScript- Many nefarious purposes (e.g. ad fraud, website hacking, bidding wars)
PhantomJS
- released in 2011 by Ariya Hidayat- some 100 contributors on GitHub
- version 2.1.1 (QtWebKit 5.5)- January 2016 (not yet fully stable)- Webkit 538.1 (November 2013) - Chrome 27, Safari 8
- version 2.0.0 (QtWebKit 5)- January 2015 (quite stable)- Webkit 537.11 (2012) - Chrome 23, Safari 6.1
- version 1.9.8- January 2014- Webkit 534.34 (2011) ~ Chrome 13, Safari 5.1
PhantomJS alternatives / companions:
- slimerJS (scriptable headless Gecko, i.e. Firefox 31)- trifleJS (scriptable headless Internet Explorer)- Zombie.js (scrptable headless Node.js)- casperJS (utilities & syntactic sugar over PhantomJS and SlimerJS)
PhantomJS: Hello, World!
$ cat hello.js console.log('Hello, world!');phantom.exit();
$ phantomjs hello.js Hello, world!
$ phantomjs [options] somescript.js [arg1 [arg2 [...]]]
PhantomJS: Loading and Rendering a Page
var page = require('webpage').create();page.open('http://example.com', function(status) { console.log("Status: " + status); if(status === "success") { page.render('example.png'); } phantom.exit();});
Page render supports different formats (jpg, png, pdf), clipping regions, scroll position, zoom, and render quality.
PhantomJS: Code Evaluation
Based on http://www.slideshare.net/SergeyShekyan/shekyan-zhang-owasp
PhantomJS JavaScript
context
QtWebKit
Web page JavaScript
context
Control
PageEvent
Injection
Callback
var page = require(‘webpage’).create();page.open(url, function(status) { var title = page.evaluate(function() { return document.title; }); console.log(‘Page title is ‘ + title);});
page.evaluate
- page.evaluate is executed in web page JavaScript context!- page.evaluate serializes and deserializes data structures upon return
(The rule of thumb: if it can be serialized via JSON, then it is fine.)- page.evaluateAsync does the same thing but without blocking the current execution
var page = require(‘webpage’).create();page.open(url, function(status) { var personData = page.evaluate(function() { var nameEl = document.querySelector(‘input#name’); var emailEl = document.querySelector(‘input#email’); if (nameEl == null || emailEl == null) { return null; } return {name: nameEl.value, email: emailEl.value}; }); console.log(‘Person data: ‘ + JSON.stringify(personData));});
Injecting scripts in PhantomJS JavaScript context
Injecting scripts in web page JavaScript context
page.injectJs works the same as page.includeJs except that it pauses execution until the script is fully loaded.
PhantomJS: Loading and injecting scripts
var wasSuccessful = phantom.injectJs('lib/utils.js');
var page = require('webpage').create();page.open('http://www.sample.com', function() { page.includeJs("https://cdnjs.cloudflare.com/libs/jquery.js", function() page.evaluate(function() { $("button").click();});});});
Example module (universe.js)
This module can be used in another script like the following:
PhantomJS: Support for CommonJS Modules
exports.answer = 42;exports.start = function () { console.log('Starting the universe....');}
var universe = require('./universe');universe.start();console.log('The answer is', universe.answer);
Global cookie jar
Page specific cookie jar
PhantomJS: Cookie handling
phantom.addCookie({ 'name': 'Added-Cookie-Name', 'value': 'Added-Cookie-Value', 'domain': '.google.com'});
var page = require('webpage').create();page.addCookie( ...);
PhantomJS: Handling frames
var frameName = page.framesName[0];var page = require('webpage').create();page.switchToFrame(frameName);page.evaluate(function() { document.querySelector('a#target').click();});
PhantomJS: Remote control via web servervar $q = require("q"); // Kris Kowal’s Qvar server = require('webserver').create();
var simpleProxyService = server.listen(HTTP_PORT, function(request, response) { $q.Promise(function(resolve, reject, notify) { var page = require('webpage').createPage(); page.onLoadFinished = function(status) { resolve(page.content); }; page.open('https://www.example.com/'); }).then(function (result) { response.statusCode = 200; response.write(result); response.close();});});
PhantomJS and Node.js don’t like each other.
PhantomJS tips & tricks
onResourceRequested: requestData
Know what requests are being made by the page:
- very useful for debugging
var webPage = require('webpage');var page = webPage.create();
page.onResourceRequested = function(requestData, networkRequest) { console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData));};
onResourceRequested: abort()
Abort the current network request
- Speed up page rendering (e.g. by not loading tracking JS libraries and large images)
- Prevent PhantomJS crashes triggered by external libraries
var webPage = require('webpage');var page = webPage.create();
page.onResourceRequested = function(requestData, networkRequest) { networkRequest.abort();};
onResourceRequested: changeUrl(newUrl)
Provide an alternative implementation of a resource:
- Mocking-up libraries & altering page functionality- e.g. networkRequest.changeUrl(requestData.url.replace('perPage=20', 'perPage=1000'));
- Speed up page rendering (e.g. by replacing remote resources with local copies)
var webPage = require('webpage');var page = webPage.create();
page.onResourceRequested = function(requestData, networkRequest) { networkRequest.changeUrl(‘dummy.js’);};
Changing request headers before the request is made:
- Mocking-up requests
onResourceRequested: setHeader(key, value)
var webPage = require('webpage');var page = webPage.create();
page.onResourceRequested = function(requestData, networkRequest) { networkRequest.setHeader(‘Authorization’, ‘Bearer 08xvgs7sbd6d’);};
onResourceReceived
var fs = require('fs');var page = require('webpage').createPage();// Do not forget to set this!page.captureContent = ['app.example.com/account/billing'];
page.onResourceReceived = function(response) { if (response.url.indexOf('app.example.com/account/billing') >= 0 && response.body.length == 0) { fs.write(‘invoice.pdf’, response.body, 'b');}
page.open(‘app.example.com/account/billing/invoice_1234.pdf’);
Retrieving body content upon onResourceReceived does not work in PhantomJS 2.1.x (it’s a known bug!)
Making async XMLHttpRequests
page.evaluate(function() { var http = new XMLHttpRequest(); http.open('POST', 'https://www.example.com/search?search_type=users', true); http.setRequestHeader('Content-type', 'application/json'); http.onreadystatechange = function() { if (http.readyState == 4 && http.status == 200) window.callPhantom(http.responseText); }
http.send('{"search":{"page":1,"per_page":1000}}');});
page.onCallback = function(responseText) { var result = JSON.parse(responseText);}
Mouse clicking & key pressing
The events are not synthetic DOM events, each event is sent to the web page as if it comes as part of user interaction.
var SHIFT_KEY = 0x02000000;var ALT_KEY = 0x08000000;
var page = require('webpage').create();page.open('http://phantomjs.org/quick-start.html', function(status) { var element = page.evaluate(function() { return document.querySelector('img[alt="PhantomJS"]'); });
page.sendEvent('click', element.offsetLeft, element.offsetTop, 'left'); page.sendEvent('keypress', page.event.key.A, null, null, SHIFT_KEY | ALT_KEY);});
Detecting PhantomJS
Exploiting differences between PhantomJS and a real browser:
- outdated WebKit engine- uses QtWebKit wrapper around WebKit- no video and audio- no plug-ins- exposes window.callPhantom and window._phantom- no sandboxing (turn a headless browser against the attacker :) )
Detailed information available in http://www.slideshare.net/SergeyShekyan/shekyan-zhang-owasp
Aggregating data behind login screens
Logging in
- open login screen, enter credentials, and click log in button (the most common scenario)
- POSTing credentials- logging in using identity providers (e.g. Google, GitHub)
- 1st, log in to identity provider,- 2nd, click on “Login with Google” or “Login with GitHub” button (voilà!)
- 2-factor authentication- keep the PhantomJS session running, while asking user to enter 2nd factor- 2FA is not a panacea for session hijacking!
- CAPTCHAs- screengrab CAPTCHA and ask user to solve it, while keeping session running
POSTing credentials example
var settings = { operation: "POST", encoding: "utf8", headers: { 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8' }, data: encodeURI('username=' + username + '&password=' + password)};
var page = require('webpage').create();page.open('https://app.example.com/v2/users/login/', settings, processLogin);
Parsing dashboard data
var invoices = page.evaluate(function() { var result = []; var invoiceElements = document.querySelectorAll('table#invoice-list > tbody > tr'); for (var i = 0; i < invoiceElements.length; i++) { var invoiceFields = invoiceElements[i].querySelectorAll('td'); var invoiceURL = invoiceFields[0].querySelector('a').href; result.push({ invoiceDate: new Date(Date.parse(invoiceFields[1].textContent.trim())), invoiceID: invoiceURL.match(/invoice_as_pdf\/(.*)/i)[1], amount: invoiceFields[2].textContent.trim(), description: invoiceFields[0].textContent.trim(), invoiceURL: invoiceURL }); } return result;});
Make use of unofficial APIs
(used extensively by modern javascript-based web applications)
var page = require('webpage').createPage();page.captureContent = ['projects.exampleapp.com/api/account'];
page.onResourceReceived = function(response) { if (response.url.indexOf('projects.exampleapp.com/api/account') >= 0 && response.body.length == 0) { var accountInfo = JSON.parse(response.body);}
page.open(‘https://projects.exampleapp.com/d/main#/team/account’);
Thank you!
(It’s Q&A time now!)
Fun Times Ahead
Squrb has an early start in a potentially enormous market.
We’re looking for a product-minded engineer with solid JavaScript knowledge to join the core team.
Contact me at [email protected] for more information.