linked data infrastructure soma - the insight centre for data … · 2016-02-23 · distributed...
TRANSCRIPT
Soma: Linked Data Infrastructure
What is Soma?
It’s Big Data Candy for the Cloud.
The Soma platform helps Data Scientist to collaborate together to discover and share new facts from large datasets hosted on shared infrastructure.
All this while lowering development & operations bottom line.
Meet our CustomersExpertSee themselves as “experts” or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation.
CreativePeople who see themselves as Data “artists”. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.
EngineerSee themselves as “engineers”. Focused on the technical problem of managing data — how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.
ResearcherSee themselves as “scientists”. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders.
Customers we support now
Creative Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.
EngineerFocused on the technical problem of managing data Normally strong software developers
ResearcherPeople with deep academic background in science, maths, machine learning Reluctant coders.
What we deliver to customers
CreativeNow:
● Gitlab integration● from gitlab● Web facing applications
ResearcherNow:
● Discovery early adoptersEarly September
● Discovery platform rollout
EngineerNow:
● Big Data Cluster● Container Management
November:● Storage frameworks
Fully operational big data stationRight NowMesos based Cloud O/S● Cluster of 88 CPUs 295 GB of memory● Distributed Application Scheduling● Resource Scheduling
Container ManagementDNS service discover
Features
Deployment
GitlabMesos ClusterZookeeper ClusterHDFS ClusterIntegrated DNSCI serversDocker Registry
Gitlab● All applications MUST be in gitlab
Mesos Cluster and Container Manager● Let’s have a look at what is running right now:
Deeper Dive
“can mix both batch and real-time processing”
“process at batch and real-time Velocity”
Lambda architecture
Data sources
Source Control ManagementContinuous DeploymentService MonitoringAlways available key datasets● DBPedia● SemanticWeb Dogfood
Features
1. Have gitlab account2. Ask Research ops to add Soma Role to your project3. If you are accepted you will be guided through
“dockerizing” you gitlab project4. Once accepted, every push to your master branch will be
deployed and accessible online through soma.
Continuous Deployment
Integrated Discovery platformSOMA Discover - hosted discovery tool based on smarter
data project allowing exploration of data and sharing results.
Other internal tools such as Sig.ma, Social Lens, and other projects to follow.
Features
Goals for Research Ops
Nurture a Data Engineering community at Insight with supportive experts, shared tools & best practices
Provide a Shared analytics platform for Data Scientists at Insight (Soma)
Encourage new research and engagements with the wider big data analytics research community
Nurture● Provide a structured approach to managing and
releasing all Engineering IP (Code and Data) at insight○ Source control (Git)
○ release management
○ Assist in IP management
● Provide Quality Circles for Engineering practices○ 2 Groups - Data Visualisation & Big Data, Workshops to
commence this month.
Provide● Build big data infrastructure for Insight
○ Soma platform
● Support Hadoop ongoing development○ Hadoop clusters, Dataspace support
● Support Ad Hoc projects requiring scale○ Cancer atlas
● Provide “Big Data” Expertise to the Linked Data group○ Hadoop, Yarn, Mesos, Spark, Dataspace, Mongo and Virtuoso
Problems being met
● High cost in research when data scales to “Big Data” [P1]○ Ad Hoc Maintenance of big data sets is expensive [P2]
○ Development complexity of valuable Big Data jobs is prohibitive
[P3]
● The high cost in Operating Big Data infrastructure [P4]○ Scarcity of hardware and lack of funds for new Hardware [P5]
○ Inability to maintain a core operations team [P7]
● Missed opportunity for researcher to collaborate [P6]
Soma serving our customers
Soma Create - Serves data fresh from the source. Has queryable large datasets that are both highly available & up-to-date. Has service to mash these up.
Soma Engineer - Provides a Lambda architecture consuming, cleaning, processing and loading the data to the data layer.
Soma Discover - Useful blocks of processing that can connected together using a nice GUI, works with many datastores
Soma Expert - vertical applications solving a real world problem, these apps are built by Insight’s Data Researchers and Data Creatives.
The 4 kinds of Data ScientistExpertSee themselves as “experts” or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation.
CreativePeople who see themselves as Data “artists”. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.
EngineerSee themselves as “engineers”. Focused on the technical problem of managing data — how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.
ResearcherSee themselves as “scientists”. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders.
Goals
Soma to be a complete ecosystem to help researchers deliver “Big Data” distributed applications
Showcase Insight expertise Standardize best practices for linked data at big data scalesDelivers targeted applications & tools tools to build complex analytics apps & job management
Distributed O/S (Better than cloud)
● We use Mesos based infrastructure to provide○ Scheduling Process Execution of Jobs/Applications across the
cluster
○ Resource scheduling of the needed CPU/Memory/Storage for
these applications
SOMA Discover (Data)
Where we are now
What we haveSoma Engineer - Standard Mesos platform - Provides a
Lambda architecture consuming, cleaning, processing and loading the data to the data layer.
Soma Discover - Smarter Data - an interactive expressive query tool creates data blocks & visualisations
What we need help onSoma Expert - Pivoty - a medical index built from
standard HCLS datasets and uses a Pivot BrowserSoma Create - The Insight Standard Dataset - a shared
queryable standard set of big-data sources