multimedia search engine michal krsek, uisk charles university at prague & cesnet ivan doležal,...
TRANSCRIPT
Multimedia search engine
Michal Krsek, UISK Charles University at Prague & CESNETIvan Doležal, CESNET
Michal Illich, Jyxo
Electronic Media
• TV & radio
• Organized in channels
• Zero democracy in programming (by channel management)
• Centralized production (big guys business)
Internet
• Not only web (audio/video and others)– remember archie.sura.net?
• IPTV / Live / Video on demand
• Navigation only via web
=> not easy to find specific program in A/V
Search options I
• Voice recognition– Language identification– Accents
• Video recognition– Text interpretation (bush vs. Bush)– Low video quality
Search options II
• Indexing of web pages– Yahoo! does (google bomb target)
Metadata– “Out of the band Metadata” (as in librarian
world)– Metadata in files (added during editing or
encoding)
Project description
• Started in 2003 (oh yes, one year before Truveo)
• “Google for audio and video on Internet”
• No support from content owners
• Modular concept
• Start with .cz Internet
Technical description I
• Crawler– Crawls web and collects addresses (URL)– Exports URL of multimedia files– Software written by Jyxo (Linux console app)
Technical description II
• Distiller– Imports addresses of multimedia files– Distills metadata (and makes XML files)– Makes screenshots (if video in file)– C# software and mplayer (windows apps)– Runs in distributed environment
Technical description III• Database
– Imports XML metadata files to full text DB – Responses back-end queries for web queries – And others fulltext things (i.e. language)
www.yournamehere.
edu
crawlingCrawls webpages
Gets addressesFilter A/V adresses
distillation
Gets metadata from multimedia files
indexingsearch
Holds fulltext databaseProvides back end for querries
Distillation• Proces description
– Get URL from DB– Get metadata from file available at URL– Get screenshots at 1,30,50 sec – Save metadata & screenshot
Distillation• Use of win32 applications
– Native players (WMP, RP, Qt) for metadata– Mplayer for screenshots
• Takes average one minute– Slow servers/bandwidth– Streaming without fast fw
DistillerGRID• <= need 16 years to distill 8.500.000 URLs• Ideal application for GRID computing
– Not need of real time response
– Huge amount of computing time needed
• Two ways to create GRID– Build dedicated system
– Use of current capacities
Computing machines• PC/Windows based• HW independent• Secure environment
– Security of hosting system
– Security of distillation process
• Well connected• Not needed to run 24x7• Easy to manage
Configuration• ~100 PCs in student labs • Running on demand during weekends• Virtual machines (MS VPC 2004) in hosting
system (Win XP)• Three different HW configurations • Peak rate about 5000 URLs per minute • SQL as background -> pull distribution of work
Actual status I• HW
– 20 crawlers– 2 servers for fulltext DB (<1.400 USD)– Distillation stations (X office PC)– Connected by 1 Gb/s to CESNET2 -> GEANT2
Actual status II• Database
– EU + .com, .edu– > 13.000.000 URLs– > 8.000.000 valid– > 2.800.000 with screenshots
Live show?
Want to test?• URLs
– http://multimedia.jyxo.cz – http://videoserver.cesnet.cz/videoarchiv_en.php
– For XML interface send me e-mail
Questions ?Comments ?
Michal Krsek, [email protected] (academic service, cooperation)Michal Illich, [email protected] (business service)