dialog engine for product information

Dialog Engine for Product Information

Vamsee Chamakura - 201301243Satyam Verma-201505604Jitta Divya Sai-201225167

Problem StatementThe aim of the project is to implement a dialog system, which

answers the users queries related to product information. The implementation of this project for has been divided into the

following modules:Crawling and scraping the product information from Flipkart.Processing the scraped data.Saving the data in MongoDB (NoSql database).Preprocessing the query.Querying the database and extracting the relevant results.

Database

Crawler&

Scraper

Query Processor

MongoDB Driver

User’s Query

Result

Flipkart

System Design

Crawling and Scraping Tools used were Scrapy and BeautifulSoup for crawling the data

from Flipkart’s website.The categories that were scraped is mobiles, televisions, laptops,

air conditioners, refrigerators and cameras.The amount of data that was extracted was around 3000 products

from the above mentioned categories.

Processing the Scraped DataBeautifulSoup is a python library used for extracting data from the

HTML or XML pages.We used BeautifulSoup to extract all the properties from the crawled

web pages. Different products had varied number and type of properties, so we

used MongoDB for flexibility.r = requests.get(url)

if r.status_code == 200:

soup = BeautifulSoup(r.content,"lxml") keys = soup.find_all("td", {"class": "specsKey"})

vals = soup.find_all("td", {"class": "specsValue"})

Storing the Data in MongoDBMongoDB is a NoSql database used for storing big data.The database has six collections namely:

Mobiles

Laptops

Television

Air Conditioner

Camera

Refrigerator

Contd…

● It stores each row as a JSON, called a document and allows a lot of flexibility in storage.

● The properties of each product are stored as key-value pairs with the primary key as model name.

Preprocessing the QueryWe handled three types of queries:

Template Based Queries.Natural Language Based Queries.Comparison Based Queries.

Template for queries :QUERY SYNTAX ERROR, REQUIRED: [PRODUCT NAME], [PROPERTY]

NL Queries:What is the price of Apple iPhone 5s?

● CB Queries:Which among apple iphone, samsung galaxy has the best price?

Natural Language Based Queries : In the template based queries, we directly get the product and

property names that the user is interested in. But in NL queries we need to extract them from the given sentence.This needs a pre-processing step, which involves removal of stop

words.The probability that the property name comes before the product

name is very high due to syntactic constraints in the English language.

So, keeping this in mind the property name and product name are extracted.

Approach The approach followed:

We maintain three lists namely - product name list, brand name list and property list.

We extract the brand of the product from the given query by iterating through the brand name list using edit distance algorithm which also helps in handling spelling errors or typos.

Elements of product_name list are tuples of size 3 - brand, model name and category.

After the extraction of the brand name, we consider only those products from that particular brand for further processing.

Contd..Approach continued:

For determining the exact model name and the property name, we use a similar approach but add an additional similarity measure along with the edit distance as mentioned before.

The second similarity measure is calculated by dividing the maximum length of the two strings by the number of character matches between the strings.

We take the harmonic mean of the edit-distance score and the above metric to get a final similarity measure.

We take the top 10 results for products and the best one for property.

Similarity Measure - Edit Distancedef editDistance(word1, word2):

len_1=len(word1)

len_2=len(word2) x =[[0]*(len_2+1) for _ in range(len_1+1)]

for i in range(0,len_1+1):

x[i][0]=i

for j in range(0,len_2+1): x[0][j]=j for i in range (1,len_1+1):

for j in range(1,len_2+1):

if word1[i-1]==word2[j-1]:

x[i][j] = x[i-1][j-1] else :

x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1

return x[i][j]

Similarity Measure - 2def similarityMetric(word1, word2):

w1_c = [0]*256

w2_c = [0]*256 for i in word1:

w1_c[ord(i)] += 1

for i in word2:

w2_c[ord(i)] += 1

matched_words = 0

for i in xrange(256):

matched_words += min(w1_c[i], w2_c[i])

if (matched_words == 0):

return 99999

return (max(len(word1), len(word2)) / float(matched_words))

For every product name from the top 10, we query the database and obtain respective results and display to the user .

Querying the Database

Since scraping the data requires sending a lot of requests to a server, there is a chance that the server may temporarily block or blacklist our IP. But fortunately, we did not come across this problem.

Initially we hand picked a few specific properties, but as we expanded the domain we had to include all the available properties. This posed a problem, when it came to answering queries that asked for a particular property, but the corresponding data was stored under a synonymous or extended name. This can be handled using Synsets (WordNet).

Challenges

Results

Thank You

dialog engine for product information

Engineering