Download - My weekend startup: seocrawler.co
My weekend startup project
SEOCRAWLER.CO
TEAM
Goran ČandrlićConversion, Google AdWords & Internet Marketing SpecialistWebiny Cofounder
Hrvoje HudoletnjakSoftware developerMicrosoft ASP.NET/IIS MVP
WHY?
Target marketWeb masters, site ownersMarketers
Usage scenariosGet broken pages, redirects, non-index, non-follow, ...On-site SQL qualityCrawl competitor pages and find out what are they doing
Business modelFreePay as you goShare and get credits
THE PLAN
Let’s build a crawlerMVP version: download CSV file of all pages Public launch: browsing crawled pages online, payments
Let’s spread the wordUse social channel to attract more users
Let’s see what we’re missing, what can be done betterFind out what would people like to payIterate, find new niche markets, ask and listen to people
GETTING HANDS DIRTY
ENGINE DEVBasic engine: 2 daysProduction ready (horizontal scalability, disaster recovery, ...): 60+ daysFind edge cases (broken HTML), keep crawler running for days/weeks without crashingAnalysis (tags and content)Store reports for user filtering and browsing
WEB APPLanding page + admin UI (Themeforest)Communication with crawlersBrowse reports, filtersPayment gateway integration (Paypal)Ticketing support system
CURRENT STATUS
2,5m pages crawled150GB transfered800 registered users
Most important things:we (think we) know what should we do nextpolished some edge cases, made more stable servicegot the word spreadgot speaking slot at WebCampZg!!
FRONT END WEB APP CRAWLERS
RABBIT MQ
DB
USER
CLOUD STORAGE
CSV RESULT
HTML, CSSAJAX / WEBSOCKETS
FRONT END/ ADMIN UI
Landing page + admin theme from Themeforest ASP.NET MVC 4Entity Framework 5 (POCO, EF migrations)DotNetOpenAuth for Social loginEasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msgSignalR (fullduplex: WebSockets – Ajax pooling duplex)KnockoutJS, jQuery, ToastrStructureMap IOC/DI, Automapper (db entities <> DTO)
CRAWLER
CONTROLLER
CRAWLER WORKER
CRAWLER WORKER
CRAWLER WORKER
...
COM
MAN
D/Q
UER
Y BU
S (C
QS)
RABBIT MQ
ADO.NET / EF
LOG
CRAWLER SERVICE
Multi-threaded Crawler (vs evented crawler)Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk InsertEasyNetQ, RabbitMQ, CQS patternStructuremap, HTMLAgilityPack, NLogProtobuf
CRAWLER WORKER PROCESS
Start or ResumeResume: load state (SQL, serialized)
Get next page from queue (RabbitMQ, durable store)Download HTML (200ms – 5sec delay), HEAD req for externalCheck statuses, canonical, redirectsRun page analysers, extract data for report, prepare for bulk insertFind links
Check duplicated, blacklisted Check Robots.txtCheck if visited – cache & dbNormalize & store to queue (RabbitMQ)
Save state every N pages (Serialize with Protobuf, store byte[] to Db)
RABBITMQ + EASYNETQ
rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message =>{ _commandBus.Execute(new MakeReportCommand(message.ProjectId));});
rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id)));
ADMIN UI
SERVICE
COMMAND BUS (MEDIATOR)
Encapsulate command / query into classesIOC / DI for finding and matching handler with command/query typesEasy unit testingAOP: intercept query or command, pre/post execution (logging, auth, caching, ...)
bool alreadyVisited = _bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash));
_bus.Execute(new SavePageCommand(pageData, webPage));
public class SavePageReportHandler : IHandle<SavePageCommand>{ // implementation}
ISSUES
Everything will crash: net connection, db, thread, VM, ...Resuming / saving statesMemory issue/leaks with some frameworks Don’t optimize before profiling (memory, db)Log everythingDB indexes: how to store for fast filtering, pagingDB as queueing system (don’t)CQS: command / query separation Broken HTML, crazy linksCloud services: connections fail
LEARNED
ORMGo low level (raw SQL, bulk insert, SP) if neededProfile: memory, SQL queriesWatch for 1st level cache (ORM unit of work or session)NoSQL?
Cachingin process – in memoryPlan moving to separate service (Redis, ...)
SOAPipeline designPub/Sub, CQS pattern (Mediator)Unit testingCloud resiliance
HOSTING
Hosting:All on one server for nowStarted with EC2Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free inbound traffic!Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m)
Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4Load & stress testing (crawl 500k URLs)
Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB)
Will scale when needed
FUTURE PLANS
Fancy reportsBrand new web user interfaceIntegration with 3th party services (MajesticSEO, ...)Special page analysis NoSQL (RavenDb or Redis) for cachingWarehouse Db for browsing crawled pagesLucene for full text search (RavenDb)Refactor crawler, pipeline design, async evented design
THANK YOU! QUESTIONS?
Hrvoje Hudoletnjakm: [email protected]: twitter.com/hhrvoje
Goran Čandrlićm: [email protected]: twitter.com/chande