lattesminer: a multilingual dsl for information extraction from lattes platform

25
LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, 2011 11 th Workshop on Domain-Specific Modeling

Upload: mari

Post on 22-Feb-2016

81 views

Category:

Documents


0 download

DESCRIPTION

LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform. 11 th Workshop on Domain-Specific Modeling. Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma. October 24, 2011. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Alexandre Donizeti AlvesHoracio Hideki Yanasse

Nei Yoshihiro Soma

October 24, 2011

11th Workshop on Domain-Specific Modeling

Page 2: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction

Lattes Platform is an information system implanted

by CNPq (National Council for Scientific and

Technological Development) to manage

information on science, technology and innovation

related to researchers and institutions in Brazil

This platform is undoubtedly the major source of

information available on Brazilian researchers

Page 3: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction: Lattes Platform

http://lattes.cnpq.br

Page 4: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction

The Lattes CV system, a curricular information

system, is the main component of the platform

Currently, the Lattes CV system stores around

2,000,000 curricula of researchers, lectures,

students and professionals from diverse areas of

knowledge

Page 5: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction: Lattes CV system

http://buscatextual.cnpq.br/buscatextual

Jorge Almeida Guimaraes

Page 6: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction: Lattes curriculum (English)

Page 7: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction: Lattes curriculum (English)

Page 8: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction: Lattes curriculum (Portuguese)

Page 9: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction

In the last years, many works were developed

using data extracted from Lattes Platform of

researchers of different areas of knowledge

A common problem presented in these works is

that the curricula and the information extracted

had to be obtained manually

Page 10: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Introduction

Therefore, this system has a very high

quality information extraction potential

Page 11: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

LattesMiner

LattesMinerLattesMiner is an internal multilingual DSL for automatic

information extraction from Lattes curricula

It is composed by a set of classes written in Java that

allows developers to implement their own applications

with a high-level abstraction and expression power

Page 12: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

LattesMiner

Data Discovery is used to find the (ID) number of the researchers.

Usually, only the name of the researcher is available.

Data Acquisition is responsible for downloading the Lattes curricula

of the researchers from Lattes CV system on the Web.

Data Extraction is the main component of LattesMiner. It is

responsible for extracting data from the HTML files. The technique

of information extraction based on regular expressions was used.

The extracted data can be stored in XML files or in any database

using the Data Structure component.

The Data Visualization component is responsible for the identification

and visualization of the academic social networks. These networks are

identified by checking the relationships between researchers.

The Data Analysis component is responsible for the analysis of the

data extracted and also for the analysis of the relationships identified.

Page 13: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

LattesMiner

LattesMiner

Biodata

Board

BiodataIE

BoardIE

BoardDaoBiodataDao

lattes.miner

lattes.miner.ielattes.miner.en

lattes.miner.dao

Perfil Banca

lattes.miner.brThe LattesMiner class is composed by instances of classes Biodata

and Board, in addition to many others not presented here.

Page 14: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

LattesMiner

LattesMiner was created through a fluent interface, that

provides a compact and yet easy-read representation of

the domain problem

Fluent interfaces are implemented using the method

chaining

LattesMiner makes use of static factory methods and

imports

Page 15: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Case Study

http://plsql1.cnpq.br/divulg/RESULTADO_PQ_102003.curso

For the following examples researchers of the Computer Science area

with CNPq Research Productivity Scholarship were considered.

The list contains all the names of the researchers.

However, their corresponding (ID) number are not provided.

Page 16: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Listing 1

import java.util.*;import lattes.util.Util;import static lattes.miner.LattesMiner.*;

public class Listing1{ public static void main(String[] args) {

}}

List<String> list = new ArrayList<String>();

for (String name : Util.getList("names.txt"))

list.add( );

Util.setList(list, "ids.txt");

search(name)

Java application code

Page 17: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Listing 2

dir("cvs");

for (String id : Util.getList("ids.txt"))

download(id). save();

Code fragment used to download the lattes curricula of

the researchers.

Page 18: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Listing 3

props("mysql");for (String id : Util.getList("ids.txt")) {

}

load(id). biodata(). address();

publications( )JOURNAL . save();

This listing shows as to extracted data from Lattes curricula

of the researchers.

Page 19: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Listing 4

for (String id : Util.getList("ids.txt")){

}

// Portuguese

// English

for (Banca b : ){

}

for (Board b : ){

}

carregar(id).bancas() . getBancas()

load(id). boards() . getBoards()

if ( )

System.out.println( );

if ( )

System.out.println( );

b.ano() == 2010

b.aluno()

b.year() == 2010

b.student()

Code fragment to illustrate how the LattesMiner is used to extract information in different languages.

Page 20: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Results

The SUCUPIRA is a system for identification and visualization of academic social networks.

Here is shows the geographical distribution of the five researchers

that have published more articles in scientific journals.

Page 21: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Results

This is a graph of contacts of the five researchers that have published more in scientific journals.

The graph depicts an academic social network of the five researchers.

Nodes are presented with

the name of researcher

The color of the edges represent the number of relationships among researchers.

Page 22: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Conclusions

Currently, the Lattes curricula are available in HTML

format

LattesMiner however does not depend on the data

format because it allows users to program their

own applications with a high-level abstraction

If the data format is eventually modified, the DSL

interface remains the same

Page 23: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Conclusions

An advantage of LattesMiner is that it searches by

the name of the researcher

LattesMiner is multilingual

Another advantage is that the data extracted can

are stored in a structural format (XML or

database), allowing these data to be easily used by

others applications

Page 24: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Future work

The future step that is already being implemented

in the LattesMiner DSL is a statistical analysis of

the data

Page 25: LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

ACNOWLEDGMENTS