big data analytics for - uni-jena.de · project and its popularity as a big data processing engine...
Post on 08-Jul-2020
1 Views
Preview:
TRANSCRIPT
Prof. Dr. Taysir Hassan A. Soliman Vice Dean for Graduate Studies & Research Faculty of Computers & Information, Assiut University Assiut University BioDialog PI Nov. 16, 2016
Big Data Analytics for BioDiversity
Outline
• About Assiut • Assiut University • Faculty of Computers & Information • Research Interests • Biodiversity Informatics Previous Activities at
Assiut University • Visits and examples of Biodiversity in Egypt • Big data research and bidiversity
2
3
Assiut
4
Gamal Abdel Nasser, Second President of Egypt Jalal al-Din al-Khudayri al-Suyuti: Egyptian religious scholar Hafez Ibrahim, poet Amin Mohsen, Diplomat Mustafa Lutfi al-Manfaluti, writer and poet Pope Shenouda III of Alexandria, Pope of the Coptic Orthodox Church
Jalal El Din El Suyuti
A Few Pictures From Assiut City
5
The Dam
A Walk beside the Nile
Assiut University Entrance
The Nile
Assiut University Map
6
Assiut University
• Assiut University was established in October 1957 as the first university in Upper Egypt to prepare highly qualified graduates with the basic specialized academic knowledge and training expertise on the various necessary skills.
7
Faculties & Institutes
• Faculties: 18 • Institutes: 2 (Sugar Industry, Oncology
institute) • International Students: Yemen, Malaysia,
Kuwait, Iraq http://www.aun.edu.eg/
Faculty of Computers & Information Assiut University
9
Lab Building Administrative Building
Established in 2001
Faculty of Computers & Information Assiut University (Staff)
• Information Systems (1 professor), 1 assistant professor, 4 teaching assistants, 7 demonstrators)
• Information Technology (1 professor & 3 assistant professors), 2 TA, 6 D)
• Computer Science (2 professors, 3 associate professors, 3 associate lecturers) 6 TA, 10 D)
• Multimedia Systems (1 associate professor)
10
Faculty of Computers & Information Assiut University (Facilities)
• Undergraduate labs: 9 • Lecture Halls: 9 • Specialized labs 5: (GIS, Multimedia, HP, Big
Data, and Bioinformatics) • Research labs: 5
11
Geographic Information System Labs GIS Lab consists of three modules : GIS Undergraduate Lab GIS Research Unit GIS Servers Unit
Geographic Information System Labs Contents :
Number of (20) computer device from module (Dell OptiPlex 380) which specifications (intel Core2Duo ,2GB of Ram)
Number of (1) Plotter device from module (HP Designjet T1200) to print a geographical maps . Number of (1) Data Show device in addition to show board for it .
Asyut Medical & Public Services Application
Clinics Medical Centers pharmacies
Medical Labs
Ambulance
Public Services
Multimedia Lab
Multimedia Production Unit
Multimedia Production Unit
)Voice Recording Unit(
Multimedia Research Unit
Bioinformatics Research Lab & Big Data Labs
Information Systems Dept. Research Directions
Big Data Analytics
BioDiversity Informatics
Database Management
Data Mining
Semantic Data
Integration
Recommender Systems
Bioinformatics GIS Health Informatics
Computer Science Dept. Research Directions
Software Engineering
Distributed Computing
Computer Vision
Image Processing
High Performance Computing
Cloud Computing
Artificial Intelligence
Information Technology Dept. Research Directions
Ad Hoc Networks
Internet of Things
Mobile Computing
Vision and Robotics
Network Security
Cloud Computing
Broadcasting and media
technologies
Biodiversity Informatics Previous Activities at Assiut University
BioDiversity Informatics Workshop at Faculty of Computers and
Information, Assiut University • Number of scientists: 34 (Faculty of Science,
Computers and Information, Agriculture, EELU) and 17 (teaching assistants) Number of undergraduate students: 156
• Number of employees: 9 • A total of 216 attendees
BioDiversity Informatics Research Group
Prof. Dr.Taysir Hassan Vice Dean for Faculty of Computers & Information for Graduate Studies & Research, Assiut University PI
Prof. Dr. Medhat Moreed Vice Dean for Societal Services and Environmental Development Faculty of Science, Assiut University
Prof. Dr. Adel AbuElmagd Dean of Faculty of Faculty of Computers & Information, Assiut University
Prof. Dr. Ahmed Moharam Vice President of Fungi Research Institute Assiut University
Marwa Hussein Assistant Lecturer Information Systems Department Faculty of Computers and Information Assiut University
Majid Askar Assistant Lecturer Computer Science Department Faculty of Computers and Information Assiut University
Dr. Ahmed Taloba Assistant Professor, IS Department, FCI, Assiut University
Dr. Ahmed Albanhawy Assistant Professor, Botany Department Faculty of Science, Suez Canal
From AinShams Workshop Sept. 2016
Wady El-Hetan
Why Big Data ?
• We need big data to the distribution of biodiversity
• Once scientific data becomes an essential transparency will be a must (publications and accessibility) … Ecological data access
• Science-driven data . • In global ecology, we go with problems that
Global Environmental Changes
• Habitat loss and species extinction, • Where willanimals move to survive? • Will human development prevent them from
getting there? Solution: conservation strategies are a crucial step toward minimizing biodiversity loss. • • Oceans acidification and land use
Global BioDiversity and Human Health
Fresh Water
Infectious Diseases
Air Quality
Agriculture
Role of Plants Pharmaceuticals
WHO Report
• Measuring traits of individual organisms (nitrogen concentrations)
• Species distribution dataset (Flora, phona, geographic associations with museum data)
Questions ???
• Is it a “Data-driven” or a “knowledge-driven” science ?
• Examples of research questions we can solve through relating big data to biodiversity informatics?
• In which part of big data life cycle phases we can extract research questions for biodiversity informatics?
Example 1: Identify Biodiversity Hostpots
• It is widely acknowledged that biodiversity is much more than just the number of species in a region and a conservation strategy cannot be based merely on the number of taxa presenting an ecosystem.
• Therefore ,the idea that strongly emerges is the need to reconsider conservation priorities and to go to ward an interdisciplinary approach through the creation of science-policy partnerships.
Is it just point distributions ?????? Have a HYPOTHESIS
Other Examples
ICUN Redlists?
Other Examples
Biodiversity Data Characteristics • Voluminous • Incremental • Complex • Scalability • Heterogeneity • Has a taxonomy type • Distribution --- Global Biodiversity Information Facility
(GBIF) currently holds over 577 million occurrence records in the areas of climate change, human health, food and security, biofuels, ecosystem services.
• Genetic/ Genomic Information – environmental genomics, including metagenomics and metabarcoding
Heterogeneous Data Types
Technical & Non-technical Priority Areas for Biodiversity Informatics Research
Technical Priority Areas: • Deep analysis … > to improve data understanding; • Optimized architectures for analytics of data-at-rest and data-in-
motion; • Mechanisms for managing privacy … to enable the vast amounts of
data which are not open data (and never can be open data) to be part of the Data Value Chain;
• Advanced visualization and user experience • Data management engineering. Non-technical Priority Areas: • Skills development, • Business models and ecosystems; • Policy, regulation and standardization; • Social perceptions.
Big Data Analytics Life Cycle
Describe Preserve
Discover
Integrate
Analyze
Assure
Collect
Plan
<metadata/>
Publish
Scientific Data management
Scientist
Visualization
Visualization
E-Bird
Big Data Analytics Life Cycle
How do I assure my data for quality?
How do I choose my algorithm ?
Which type of Architecture do I use?
IDigBio
IdigBio
Big Data Challenges for ML and EDA
• Format variation of the raw data • Noisy and poor quality data • Fast moving streaming data • Trustworthiness of the data analysis • Highly distributed input sources • High dimensionality • Scalability of algorithms
Part I: Machine Learning Approaches
• One example is the usage of Deep Learning • Deep learning algorithms lead to abstract
representations because more abstract representations are often constructed based on less abstract ones.
• An important advantage of more abstract representations is that they can be invariant to the local changes in the input data.
• Learning such invariant features is an ongoing major goal in pattern recognition
Example
An image is composed of different sources of variations such a light, object shapes, and object materials. The abstract representations provided by deep learning algorithms can separate the different sources of variations in data.
Example of A DNN
Learning the parameters in a deep architecture is a difficult optimization task, such as learning the parameters in neural networks with many hidden layers.
• Google’s “word2vec” tool is a technique for automated extraction of semantic representations from Big Data.
• This tool takes a large-scale text corpus as input and produces the word vectors as output.
Deep Learning
• Extracting complex patterns from massive volumes of data,
• Semantic indexing, • Data tagging, • Fast information retrieval
Deep Learning in Biodiversity Distribution (WildeLife Monitoring)
• Affordable and effective measures of conservation outcomes.
• Improve the quality of conservation monitoring and to scale monitoring programs to meet the global need.
• Extract meaningful information from the torrent of new sensor data, and improve the adaptive management of natural systems.
Case Studies Monitoring
Invasing species
Detecting Rare
Species Monitoring Population
through time
Empower biologists to analyze petabytes of sensor data from a network of remote microphones and cameras.
This system, which is being used to monitor endangered species and ecosystems around the globe, has enabled an order of magnitude improvement in the cost effectiveness of such projects.
This approach can be expanded to encompass a greater variety of sensor sources, such as drones, to monitor animal populations, habitat quality, and to actively deter wildlife from hazardous structures.
Detecting Bird
Vocalization
Detecting Fish in
underwater
Part II: The HOW-TO … Practice
Using Spark for BioDiversity Data
• Processing snapshots of biodiversity data providers’ entire datasets locally is an important capability.
• It allows broad questions to be asked across multiple data providers without needing to wait for providers to develop integrations or interfaces with each other;
• the providers’ web interfaces and application programming interfaces (APIs) no longer limit the way data is presented
• data can be processed at a much higher rate locally instead of through APIs.
Spark • In 2014, Spark became an Apache Foundation top-level
project and its popularity as a big data processing engine has taken off.
• It is a much simpler to install and use this implementation of the map-reduce pattern of data processing than its industry-favorite predecessor, Hadoop.
• With Spark, arbitrary querying, joining, and reducing operations on and between entire biodiversity datasets can be done with very little code on a desktop computer or commonly available cloud computing resources.
• Machine Learning Library (Mllib)
iDigBio
• iDigBio – 44 million record datasets. • Sparkonomy, an iDigBio tool, was developed
to join tokenized taxon names from iDigBio to GBIF’s backbone taxonomy in a few minutes on a desktop computer.
• Effechecka from EOL is an early-phase web application that uses Spark jobs to construct checklists for taxon and spatial queries from iDigBio occurrence information.
Perform interactive analytics on observational scientific data
Grid or Many Task Software, Hadoop, Spark
Data Storage: HDFS, Hbase, File Collection
Streaming data for weather
Science Analysis Code, Mahout, R
Transport batch of data to primary analysis data system
Record Scientific Data in “field”
Local Accumulate and initial computing
Direct Transfer
Examples include Remote Sensing, Astronomy and Bioinformatics
References (1) [1] J. Salle, K. J. Williams, and C. Moritz, “BioDiversity Analysis in the Digital Era,” Phil. Trans. R. Soc. B371:20150337. [2] M. Collins, J. Poelen, A. Thompson, “Whole-Dataset Analysis using Apache Spark,” Missouri Botanical Garden Open Conference Systems, TDWG 2015 ANNUAL CONFERENCE. [3] C. Marchese, “Biodiversity Hotspots: A Shortcut for A More Complicated Concept,” Global Ecology and Conservation, Vol. 3, pp.297-309, 2015. [4] D. Klein, M. McKown, and B. Tershy, “Deep Learning for Large Scale BioDiversity Monitoring,” Bloomberg Data for Good Exchange Conference. 28-Sep-2015, New York City, NY, USA. [5] M. Najafabadi, F. Villanustre, T. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic, “Deep learning applications and challenges in big data analytics,” Journal of Big Data.
References (2) • https://bigdatacoursespring2015.appspot.com/preview • http://bigdataopensourceprojects.soic.indiana.edu/ • http://dx.doi.org/10.1098/rstb.2015.0337 • http://www.gbif.org
1/26/2015 68
top related