large scale processing with django
TRANSCRIPT
Large-scale processing
using Django
Mashing clouds, queues & workflows
PyWeb-IL 8th meetingUdi h Bauman (@dibau_naum_h)Tikal Knowledge (http://tikalk.com)
Agenda
Web apps vs. Back-end Services
Addressing Scalability
Experience with Django
Use-case 1: automated data integration service
Use-case 2: social media analysis service
Recommendations
Links
Web apps vs. Back-end Services
Common conception is that a Web framework is just for Web sites
Web back-ends become thinner - just services
Applications become service providers, usually over HTTP
All reasons for using Django for almost any back-end offering services
Web apps vs. Back-end Services
How are back-end services different?Usually have behaviors not triggered by client requests
Usually involve long processing
May involve continuous communications, & not just request-response
Reliability & high-availability are usually more important with non-human users
Lots of communication with other back-ends
Addressing the needs
of back-end services
Message Queues abstract invocation & enable reliable distributed processing
Workflow Engines manage long processing
Continuous communication (e.g., TCP-based) is possible, can be abstracted with XMPP
Clouds & auto-scaling enable high-availability
Can use SOAP/REST for protocols against other back-ends
Experience with Django
No matter how heavy & large the task & load were it just worked.
Even when processing took days to complete, Django was 100% robust
Had no issues with Performance
Large data
Protocols against other back-ends
Use-case 1: automated data integration service
Back-end service for Processing large data arriving from different sources
Integrating data & services across several back-end systems
Serving as common repository of content & metadata
All processes are automated, but expose UI dashboards & reports for manual control
Use-case 1: protocols
SOAPSome other back-ends talk SOAP
Used a great library called Suds
Works really well Simple API, very easy to introspect
Used large batches & long conversations
Only issue is with stubs cache, not updated when WSDL changes (until you manually update or reboot)
Use case 1: protocols
Message queues:Very elegant & useful for async protocols with other back-end services
Used REST interface to push & pull messages with message queues, such as ActiveMQ
Used Celery for AMQP-based message queues
Use-case 1: processing
Data filesProcessing started with upload of large archives of large data files
According to metadata, different format handlers were invoked
Python libraries worked well:SAX processing for large XML's
CSV for large flat files
Be careful with memory
Use-case 1: ETL
Eventually externalized some of the ETL processing to an external graphical toolNot because of any problem with Django-based, which was fast & easy to manage
Mainly in order to simplify architecture
Used open-source ETL tool called Talend:Graphical interface
Exports logic to Java-based scripts
Use-case 1: workflow
Integration processes are lengthy & full of business logic, constantly evolving
Used Nicolas Toll's workflow engine, which allows users to define & manage complex workflows
Modified & extended the engine to:Define different logics of action invocation
Added a graphical dashboard
Use-case 1: queues
Processes can't be done using synchronous calls, if only because you'll eventually reach the max recursion depth
Used Celery over RabbitMQ:Very simple Django integration
Used task-names for flexible handlers invocation
Used periodic tasks for driving the workflow engine
Use-case 1: cloud
Heavily used Amazon EC2 & S3 services
Horizontal & vertical scaling
Reliable & easy to manage
Message queues allow distributing load horizontally
Used script-based auto-scaling starting new instances based on load
Use-case 1: dashboard & reports
Used customized admin for application UISide menu
Template tags for non-editable associated data in forms (due to large data lists)
Used simple home-grown process dashboard
Used Google visualization for chartsCharts API generate ANY chart as image, using just a URL
Use case 2: social media
analysis service
Service for processing large streams of social media & user-generated content (e.g., twitter)
Social media is processed & analyzed to create value for end-users, e.g.:Generating daily summary of thousands of social media messages (+ referenced content), according to user's interests
Recommend people to follow based on interests
Use case 2: architecture
Due to the large amount of data we need to process, a distributed self-organizing architecture was chosen:Data entities are represented by objects with behavior
Objects are organized in hierarchical layers
Objects have autonomous micro behavior aggregating to the macro behavior of the system
Layers are organized in spatial grids, which enable easy sharding & parallel processing
Use case 2: infrastructure
Several frameworks are used for analysis servicesNLTK
Dbpedia
ConceptNet
&c
The tools are separated in a different project, to enable distribution
Use case 2: Queues
Tools invocations are asynchronous, & therefore done via message queues
Celery & RabbitMQ are used
JSON is used as message payload
Use case 2: combining clouds
The data processing divides to 2 types:On-demand: Continuous always-on
most of the data processing
Very intensive
Uses pure python business logic
Asynchronous processingCan be queued
Not always-on
Requires 3rd party libraries, not limited to Python
Use case 2: combining clouds
It therefore made sense to separate the deployment to 2 Cloud Computing vendors:Google AppEngine used for on-demand processingCost-effective for always-on intensive computing
Easy auto-scaling
Amazon EC2 used for asynchronous processingSupports any 3rd party library
Can be started just upon need
Use case 2: inter-cloud communication
To connect the 2 back-ends running on different clouds, we've used a combination of:XMPP: Instant Messaging protocol, enabling reliable network-agnostic synchronous communicationDjango-xmpp is a simple framework on the Amazon side
Google AppEngine provides native support for XMPP
Message queues: Tools invocations on Amazon side are queued in RabbitMQ/Celery
Future?
Erlang integration seems promising in the implementation of large scale services
Frameworks such as Fuzed can be integrated with Python/Django
We're working on it as a coding session & hope to deliver a prototype soon
Links
Celery
Suds
Workflow
django-xmpp
Fuzed
Google Chart API
Talend
Thanks!@dibau_naum_h