my weekend startup: seocrawler.co

Post on 27-Jan-2015

111 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Why and how is Seocrawler.co built, a talk for Webcamp Zagreb 2013 conference. Presented technical part of project with dev advices for building crawler/spider

TRANSCRIPT

My weekend startup project

SEOCRAWLER.CO

TEAM

Goran ČandrlićConversion, Google AdWords & Internet Marketing SpecialistWebiny Cofounder

Hrvoje HudoletnjakSoftware developerMicrosoft ASP.NET/IIS MVP

WHY?

Target marketWeb masters, site ownersMarketers

Usage scenariosGet broken pages, redirects, non-index, non-follow, ...On-site SQL qualityCrawl competitor pages and find out what are they doing

Business modelFreePay as you goShare and get credits

THE PLAN

Let’s build a crawlerMVP version: download CSV file of all pages Public launch: browsing crawled pages online, payments

Let’s spread the wordUse social channel to attract more users

Let’s see what we’re missing, what can be done betterFind out what would people like to payIterate, find new niche markets, ask and listen to people

GETTING HANDS DIRTY

ENGINE DEVBasic engine: 2 daysProduction ready (horizontal scalability, disaster recovery, ...): 60+ daysFind edge cases (broken HTML), keep crawler running for days/weeks without crashingAnalysis (tags and content)Store reports for user filtering and browsing

WEB APPLanding page + admin UI (Themeforest)Communication with crawlersBrowse reports, filtersPayment gateway integration (Paypal)Ticketing support system

CURRENT STATUS

2,5m pages crawled150GB transfered800 registered users

Most important things:we (think we) know what should we do nextpolished some edge cases, made more stable servicegot the word spreadgot speaking slot at WebCampZg!!

FRONT END WEB APP CRAWLERS

RABBIT MQ

DB

USER

CLOUD STORAGE

CSV RESULT

HTML, CSSAJAX / WEBSOCKETS

FRONT END/ ADMIN UI

Landing page + admin theme from Themeforest ASP.NET MVC 4Entity Framework 5 (POCO, EF migrations)DotNetOpenAuth for Social loginEasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msgSignalR (fullduplex: WebSockets – Ajax pooling duplex)KnockoutJS, jQuery, ToastrStructureMap IOC/DI, Automapper (db entities <> DTO)

CRAWLER

CONTROLLER

CRAWLER WORKER

CRAWLER WORKER

CRAWLER WORKER

...

COM

MAN

D/Q

UER

Y BU

S (C

QS)

RABBIT MQ

ADO.NET / EF

LOG

CRAWLER SERVICE

Multi-threaded Crawler (vs evented crawler)Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk InsertEasyNetQ, RabbitMQ, CQS patternStructuremap, HTMLAgilityPack, NLogProtobuf

CRAWLER WORKER PROCESS

Start or ResumeResume: load state (SQL, serialized)

Get next page from queue (RabbitMQ, durable store)Download HTML (200ms – 5sec delay), HEAD req for externalCheck statuses, canonical, redirectsRun page analysers, extract data for report, prepare for bulk insertFind links

Check duplicated, blacklisted Check Robots.txtCheck if visited – cache & dbNormalize & store to queue (RabbitMQ)

Save state every N pages (Serialize with Protobuf, store byte[] to Db)

RABBITMQ + EASYNETQ

rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message =>{ _commandBus.Execute(new MakeReportCommand(message.ProjectId));});

rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id)));

ADMIN UI

SERVICE

COMMAND BUS (MEDIATOR)

Encapsulate command / query into classesIOC / DI for finding and matching handler with command/query typesEasy unit testingAOP: intercept query or command, pre/post execution (logging, auth, caching, ...)

bool alreadyVisited = _bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash));

_bus.Execute(new SavePageCommand(pageData, webPage));

public class SavePageReportHandler : IHandle<SavePageCommand>{ // implementation}

ISSUES

Everything will crash: net connection, db, thread, VM, ...Resuming / saving statesMemory issue/leaks with some frameworks Don’t optimize before profiling (memory, db)Log everythingDB indexes: how to store for fast filtering, pagingDB as queueing system (don’t)CQS: command / query separation Broken HTML, crazy linksCloud services: connections fail

LEARNED

ORMGo low level (raw SQL, bulk insert, SP) if neededProfile: memory, SQL queriesWatch for 1st level cache (ORM unit of work or session)NoSQL?

Cachingin process – in memoryPlan moving to separate service (Redis, ...)

SOAPipeline designPub/Sub, CQS pattern (Mediator)Unit testingCloud resiliance

HOSTING

Hosting:All on one server for nowStarted with EC2Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free inbound traffic!Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m)

Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4Load & stress testing (crawl 500k URLs)

Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB)

Will scale when needed

FUTURE PLANS

Fancy reportsBrand new web user interfaceIntegration with 3th party services (MajesticSEO, ...)Special page analysis NoSQL (RavenDb or Redis) for cachingWarehouse Db for browsing crawled pagesLucene for full text search (RavenDb)Refactor crawler, pipeline design, async evented design

THANK YOU! QUESTIONS?

Hrvoje Hudoletnjakm: hrvoje@hudoletnjak.comt: twitter.com/hhrvoje

Goran Čandrlićm: gorancandrlic@gmail.comt: twitter.com/chande

top related