[rakutentechconf2013] [c-4_2] building structured data from product descriptions
DESCRIPTION
Rakuten Technology Conference 2013 "Building Structured Data from Product Descriptions" Keiji Shinzato (Rakuten)TRANSCRIPT
Building Structured Data from Product Descriptions
Keiji Shinzato
2
An Italian product. This is a fruityred wine that mainly consists of sangiovese grapes of Tuscany.
Product information extraction
Type Red
Grape variety
Sangiovese
Region Italy,Tuscany
3
Background
Attribute Value
Type 赤
Region イタリア ,トスカーナ州キャンティ地区
Grape サンジョベーゼ
Vintage 2011
トスカーナ州 キャンティ地区のサンジョベーゼ種を主体につくられる、イタリアを代表する赤ワインの一つ。
ベリンダ・コーリー キアンティ2011 750ml
Unstructured data
Structured data
• Structured data play a crucial role for making Rakuten more attractive service.– Faceted navigation, recommendation, and
market analysis.
4
Faceted navigation
Reference: http://www.amazon.com/
5
Background
Attribute Value
Type 赤
Region イタリア ,トスカーナ州キャンティ地区
Grape サンジョベーゼ
Vintage 2011
トスカーナ州 キャンティ地区のサンジョベーゼ種を主体につくられる、イタリアを代表する赤ワインの一つ。
ベリンダ・コーリー キアンティ2011 750ml
Structured data
• Structured data play a crucial role for making Rakuten more attractive service.– Faceted navigation, recommendation, and
market analysis.
• Unsupervised methodology is required.– 100 million products / 40,000 categories.
Unstructured data
6
Table is an useful clue, but…
Montes Alpha M 2009
Product page including a table
WINE > CHILE
Type Red
Region Chile
Grape Cabernet sauvignon,Merlot,Cabernet franc,Petit verdot
Year 2009
Montes Alpha M 2009
Product page consists of sentences
Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a …
WINE > CHILE
38%
7
Product information extraction
• Issue1: How do we know attributes for a category ??
• Issue2: How do we extract attribute values from full texts ??
Montes Alpha M 2009Attribut
eValue
Type Red
Region Chile
Grape Cabernet sauvignon,Merlot,Cabernet franc,Petit verdot
Vintage 2009
Company MontesProduct page (unstructured)
Structured data
Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot.A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. …
WINE > CHILE
8
Reference: http://item.rakuten.co.jp/redbox/odm3000728/
Attribute name collection
Analyze a large amount of table data for collecting attributes of an
object
Attribute namesof Wine
Attribute values
9
Attribute value database (wine)ぶどう品種(Grape variety)
内容量(Volume)
産地(Region)
生産者(Winery)
味わい(Taste)
Chardonnay 750ML France Farnese Dry
Chardonnay100%
720ML Italy Mas de Monistrol
Full body
Merlot 375ML Spain Leroy Medium body
Riesling 500ML Chile M. Chapoutier Slightly sweet
Syrah 1500ML German Mastroberardino
Sweet
Grenache 360ML Australia Santero Medium dry
Merlot 200ML America Saltarelli Extremely sweet
Tempranillo 3000ML Bordeaux Cavicchioli Medium dry
Sangiovese 1800ML Champagne Fontodi Red Full body
Syrah100% 1000ML Argentina Ca'Rugate Middle sweetPrecision is high, but coverage is low.
10
Product information extraction
• Issue1: How do we know attributes for each category ??
• Issue2: How do we extract attribute values from product descriptions ??
Montes Alpha M 2009
Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot.A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. …
WINE > CHILEAttribut
eValue
Type Red
Region Chile
Grape Cabernet sauvignon,Merlot,Cabernet franc,Petit verdot
Vintage 2009
Company MontesStructured dataProduct page (unstructured)
11
Unsupervised attribute value extraction- distant supervision approach -
Product page includingentries in the database
Rulewine from x ⇒ x is a Region
Chateau d’Issan 1994This is a wine from Margaux....
:<Region, Margaux><Color, White> :
DatabaseAnnotation
Generation
Rule is generatedthrough machinelearning algorithm.
Semi-structured data
Construction
12
Corpus with attribute-value annotations (wine)• < 産地 > アルザス </産地 > で最も香り豊かと言われるスパイシーで華やかな
ワイン。
A spicy and gorgeous wine that is known as the richest aroma one in
<production_area> Alsace </production_area>.
• 最もお手頃で、 < 生産者 > ドメーヌ・ペゴー </生産者 > の美味しさを気軽に
楽しめる、とっても嬉しい一本なのです
This is a very nice wine because we can easily enjoy the taste of <winery>
Domaine Pegau </winery> at the best price.
• < ぶどう品種 > ソーヴィニヨン・ブラン </ぶどう品種 > 種の特長がよく表れ
たワイン。
A wine that <grape_variety> Sauvignon Blanc </grape_variety> was well
featured.
• < タイプ > 白 </タイプ > 身魚の塩焼きやシンプルな味付けのソテー、焼き牡
蠣、豚のしょうが焼き、ボンゴレビアンコなどと。
Grilled or sauted <type> white </type> fish, grilled oyster, pork saute with ginger,
vongole bianco, and others.
J:
J:
J:
J:
E:
E:
E:
E:
13
Unsupervised attribute value extraction- distant supervision approach -
Product page includingentries in the database
Rulewine from x ⇒ x is a Region
Chateau d’Issan 1994This is a wine from Margaux....
:<Region, Margaux><Color, White> :
DatabaseAnnotation
Generation
Rule is generatedthrough machinelearning algorithm.
Semi-structured data
Construction
14
Extraction rule generation
• Algorithm: Conditional random fields [Lafferty+ 2001]
• Chunk tag: Start/End (IOBES) model [Sekine+ 1998]
• Features:– Token: Surface form of the token.– Base: Base form of the token.– PoS: Part-of-Speech tag of the token.– Char. type: Types of characters in the token.– Prefix: Double character prefix of the token.– Suffix: Double character suffix of the token.– The above features of ±3 tokens surrounding the token.
They are frequently employed in the task of Japanese named entity recognition.
15
Unsupervised attribute value extraction- distant supervision approach -
Product page includingentries in the database
Rulewine from x ⇒ x is a Region
Chateau d’Issan 1994This is a wine from Margaux....
:<Region, Margaux><Color, White> :
DatabaseAnnotation
Generation
Rule is generatedthrough machinelearning algorithm.
Semi-structured data
Construction
16
Unsupervised attribute value extraction- distant supervision approach -
Terre di matraja Bianco 2012
This is a wine from Tuscany....
Rulewine from x ⇒ x is a Region
Apply
Attribute
Value
Region Tuscany
Vintage 2012
Grape Chardonnay
Rule1800 < x <= 2013 ⇒ x is a Vintage
17
Performance (F-score)
Wine
Shampoo
60.1pt.
71.5 pt.
43.8 pt.
24.1pt.
With ML
Without ML
18
An Italian product. This is a fruityred wine that mainly consists of sangiovese grapes of Tuscany.
Wine / Japanese
Type Red
Grape variety
Sangiovese
Region Italy,Tuscany
19
Shampoo / Japanese
Category Shampoo
Product name
MCH Natural shampoo 1000ml
Ingredient Cypress oil,Charcoal
``MCH Natural shampoo 1000ml’’ is a shampoo consisting of cypress oil and charcoal.
20
Video game / French
Product type
Nintendo 64,Nintendo DS
Saga Mario
21
Conclusion
• Developing a technique for extracting product information from unstructured data.– Independent of any category and language.
• Useful services can be realized on structured product data.
• Our paper is available on the web.– ACL anthology:
http://aclweb.org/anthology//I/I13/
22
Thank you for listing !