large scale processing with django

Large-scale processing
using Django

Mashing clouds, queues & workflows

PyWeb-IL 8th meetingUdi h Bauman (@dibau_naum_h)Tikal Knowledge (http://tikalk.com)

Agenda

Web apps vs. Back-end Services

Addressing Scalability

Experience with Django

Use-case 1: automated data integration service

Use-case 2: social media analysis service

Recommendations

Links


Common conception is that a Web framework is just for Web sites

Web back-ends become thinner - just services

Applications become service providers, usually over HTTP

All reasons for using Django for almost any back-end offering services


How are back-end services different?Usually have behaviors not triggered by client requests

Usually involve long processing

May involve continuous communications, & not just request-response

Reliability & high-availability are usually more important with non-human users

Lots of communication with other back-ends

Addressing the needs
of back-end services

Message Queues abstract invocation & enable reliable distributed processing

Workflow Engines manage long processing

Continuous communication (e.g., TCP-based) is possible, can be abstracted with XMPP

Clouds & auto-scaling enable high-availability

Can use SOAP/REST for protocols against other back-ends

Experience with Django

No matter how heavy & large the task & load were it just worked.

Even when processing took days to complete, Django was 100% robust

Had no issues with Performance

Large data

Protocols against other back-ends

Use-case 1: automated data integration service

Back-end service for Processing large data arriving from different sources

Integrating data & services across several back-end systems

Serving as common repository of content & metadata

All processes are automated, but expose UI dashboards & reports for manual control

Use-case 1: protocols

SOAPSome other back-ends talk SOAP

Used a great library called Suds

Works really well Simple API, very easy to introspect

Used large batches & long conversations

Only issue is with stubs cache, not updated when WSDL changes (until you manually update or reboot)

Use case 1: protocols

Message queues:Very elegant & useful for async protocols with other back-end services

Used REST interface to push & pull messages with message queues, such as ActiveMQ

Used Celery for AMQP-based message queues

Use-case 1: processing

Data filesProcessing started with upload of large archives of large data files

According to metadata, different format handlers were invoked

Python libraries worked well:SAX processing for large XML's

CSV for large flat files

Be careful with memory

Use-case 1: ETL

Eventually externalized some of the ETL processing to an external graphical toolNot because of any problem with Django-based, which was fast & easy to manage

Mainly in order to simplify architecture

Used open-source ETL tool called Talend:Graphical interface

Exports logic to Java-based scripts

Use-case 1: workflow

Integration processes are lengthy & full of business logic, constantly evolving

Used Nicolas Toll's workflow engine, which allows users to define & manage complex workflows

Modified & extended the engine to:Define different logics of action invocation

Added a graphical dashboard

Use-case 1: queues

Processes can't be done using synchronous calls, if only because you'll eventually reach the max recursion depth

Used Celery over RabbitMQ:Very simple Django integration

Used task-names for flexible handlers invocation

Used periodic tasks for driving the workflow engine

Use-case 1: cloud

Heavily used Amazon EC2 & S3 services

Horizontal & vertical scaling

Reliable & easy to manage

Message queues allow distributing load horizontally

Used script-based auto-scaling starting new instances based on load

Use-case 1: dashboard & reports

Used customized admin for application UISide menu

Template tags for non-editable associated data in forms (due to large data lists)

Used simple home-grown process dashboard

Used Google visualization for chartsCharts API generate ANY chart as image, using just a URL

Use case 2: social media
analysis service

Service for processing large streams of social media & user-generated content (e.g., twitter)

Social media is processed & analyzed to create value for end-users, e.g.:Generating daily summary of thousands of social media messages (+ referenced content), according to user's interests

Recommend people to follow based on interests

Use case 2: architecture

Due to the large amount of data we need to process, a distributed self-organizing architecture was chosen:Data entities are represented by objects with behavior

Objects are organized in hierarchical layers

Objects have autonomous micro behavior aggregating to the macro behavior of the system

Layers are organized in spatial grids, which enable easy sharding & parallel processing

Use case 2: infrastructure

Several frameworks are used for analysis servicesNLTK

Dbpedia

ConceptNet

&c

The tools are separated in a different project, to enable distribution

Use case 2: Queues

Tools invocations are asynchronous, & therefore done via message queues

Celery & RabbitMQ are used

JSON is used as message payload

Use case 2: combining clouds

The data processing divides to 2 types:On-demand: Continuous always-on

most of the data processing

Very intensive

Uses pure python business logic

Asynchronous processingCan be queued

Not always-on

Requires 3rd party libraries, not limited to Python

Use case 2: combining clouds

It therefore made sense to separate the deployment to 2 Cloud Computing vendors:Google AppEngine used for on-demand processingCost-effective for always-on intensive computing

Easy auto-scaling

Amazon EC2 used for asynchronous processingSupports any 3rd party library

Can be started just upon need

Use case 2: inter-cloud communication

To connect the 2 back-ends running on different clouds, we've used a combination of:XMPP: Instant Messaging protocol, enabling reliable network-agnostic synchronous communicationDjango-xmpp is a simple framework on the Amazon side

Google AppEngine provides native support for XMPP

Message queues: Tools invocations on Amazon side are queued in RabbitMQ/Celery

Future?

Erlang integration seems promising in the implementation of large scale services

Frameworks such as Fuzed can be integrated with Python/Django

We're working on it as a coding session & hope to deliver a prototype soon

Links

Celery

Suds

Workflow

django-xmpp

Fuzed

Google Chart API

Talend

Thanks!@dibau_naum_h

large scale processing with django

Technology