how to make google books at home

Post on 18-Nov-2014

3.751 Views

Category:

Technology

6 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

How to make Google Books

at home

in Perl

at home

not to beat Google

What do we have in the internet today?

Find a word

Find the words

Google

Show the page

Show the pageand highlight the words

... Russia ... ece yapc/coe ...

... Russia ... ece yapc/coe ...

Pushkin?

... Russia ... ece yapc/coe ...

YAPC?

... Russia ... ece yapc/coe ...

XIX?

... Russia ... ece yapc/coe ...

WTF?

ece yapc/coe

ece yapc/coeвсе царское

все царскоеece yapc/coe

Amazon

Guess the next screen

Text archive

Berlin

How to make it

PDF

PDF

PDF(Black box)

PDF WEB

Sample PDFuse.perl.org/~andy.sh/journal

Work with PDF?

Work with PDF?No

SVG

SVGScalable vector graphics

SVGScalable vector graphics

http://www.w3.org/Graphics/SVG/

SVG is XML

SVG is XMLXML::LibXML

SVG is XMLXML::LibXMLXPath

SVG is XMLXML::LibXMLXPath

XSLT

SVG

PDFhttp://www.pdftron.com/pdf2svg/

$ ./pdf2svg book.pdf book.svg

Structure

Geometry

<g></g>

<g>    <g>    </g></g>

<g>    <g>    </g>    <g>    </g></g>

<g>    <g>        <text>        </text>    </g>    <g>    </g></g>

<g>    <g>        <text>        </text>        <text>        </text>    </g>    <g>    </g></g>

<g>    <g>        <text>          <tspan>          </tspan>        </text>        <text>        </text>    </g>    <g>    </g></g>

<g>

<text>

<text    transform=...>

<text    transform=    "matrix(      1 0 0 ‐1       10 584    )">

Page

Pageg

Pageg

text

Pageg

text + transform

<tspan>

Pageg

text + transform

tspan

    my $transform = $node‐>findvalue('@transform');    if ($transform =~ /matrix/) {        my ($sx, $sy, $tx, $ty) = $transform =~ /matrix\((‐?\d+(?:\.\d+)?) ‐?\d(?:\.\d+)?+ ‐?\d(?:\.\d+)?+ (‐?\d+(?:\.\d+)?) (‐?\d+(?:\.\d+)?) (‐?\d+(?:\.\d+)?)\)/;                print "($sx, $sy, $tx, $ty)";        $pos{x} = $sx * $tx;        $pos{x} += $pos{pagew} if $sx < 0;        $pos{y} = $sy * $ty;        $pos{y} += $pos{pageh} if $sy < 0;        print " [$pos{x}, $pos{y}]";    }

<tspan 

x="0,16.875,26.258,34.695,4

0.314,44.533,49.224,55.789,

60.008,64.699" y="‐0" 

class="ps00 ps23">What is 

it</tspan>

<tspan 

x="0,16.875,26.258,34.695,4

0.314,44.533,49.224,55.789,

60.008,64.699" y="‐0" 

class="ps00 ps23">What is 

it</tspan>

<tspan 

x="0,16.875,26.258,34.695,4

0.314,44.533,49.224,55.789,

60.008,64.699" y="‐0" 

class="ps00 ps23">What is 

it</tspan>

<tspan 

x="0,16.875,26.258,34.695,4

0.314,44.533,49.224,55.789,

60.008,64.699" y="‐0" 

class="ps00 ps23">What is 

it</tspan>

<tspan 

x="0,16.875,26.258,34.695,4

0.314,44.533,49.224,55.789,

60.008,64.699" y="‐0" 

class="ps00 ps23">What is 

it</tspan>

<tspan 

x="0,16.875,26.258,34.695,4

0.314,44.533,49.224,55.789,

60.008,64.699" y="‐0" 

class="ps00 ps23">What is 

it</tspan>

<tspan 

x="0,16.875,26.258,34.695,4

0.314,44.533,49.224,55.789,

60.008,64.699" y="‐0" 

class="ps00 ps23">What is 

it</tspan>

YAPC

<tspan>YAPC</tspan>

Y APC

Y APC

<tspan>Y</tspan>

<tspan>APC</tspan>

Dictionary

mysql> select * from base where base like 'seek';

+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+

| id     | base | rules | grammar |

+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+

| 189785 | seek | GRSZ  |         | 

+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+

mysql> select * from word where ref = 189785;

+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+

| ref    | word    |

+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+

| 189785 | seek    | 

| 189785 | seeking | 

| 189785 | seeker  | 

| 189785 | seeks   | 

| 189785 | seekers | 

+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+

YAPC attendee seeks where to drink after the evening talk.

YAPC attendee seeks where to drink after the evening talk.

Morphology

YAPC attendee seeks where to drink after the evening talk.

Stop words

YAPC attendee seeks where to drink after the evening talk.

yapc attendee seeks where to drink after the evening talk.

DEMO

live demonstration at

http://booksearch.andy.sh

__END__

Andrew Shitov

http://andy.sh

top related