dialog engine for product information
TRANSCRIPT
Dialog Engine for Product Information
Vamsee Chamakura - 201301243Satyam Verma-201505604Jitta Divya Sai-201225167
Problem StatementThe aim of the project is to implement a dialog system, which
answers the users queries related to product information. The implementation of this project for has been divided into the
following modules:Crawling and scraping the product information from Flipkart.Processing the scraped data.Saving the data in MongoDB (NoSql database).Preprocessing the query.Querying the database and extracting the relevant results.
Database
Crawler&
Scraper
Query Processor
MongoDB Driver
User’s Query
Result
Flipkart
System Design
Crawling and Scraping Tools used were Scrapy and BeautifulSoup for crawling the data
from Flipkart’s website.The categories that were scraped is mobiles, televisions, laptops,
air conditioners, refrigerators and cameras.The amount of data that was extracted was around 3000 products
from the above mentioned categories.
Processing the Scraped DataBeautifulSoup is a python library used for extracting data from the
HTML or XML pages.We used BeautifulSoup to extract all the properties from the crawled
web pages. Different products had varied number and type of properties, so we
used MongoDB for flexibility.r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.content,"lxml") keys = soup.find_all("td", {"class": "specsKey"})
vals = soup.find_all("td", {"class": "specsValue"})
Storing the Data in MongoDBMongoDB is a NoSql database used for storing big data.The database has six collections namely:
Mobiles
Laptops
Television
Air Conditioner
Camera
Refrigerator
Contd…
● It stores each row as a JSON, called a document and allows a lot of flexibility in storage.
● The properties of each product are stored as key-value pairs with the primary key as model name.
Preprocessing the QueryWe handled three types of queries:
Template Based Queries.Natural Language Based Queries.Comparison Based Queries.
Template for queries :QUERY SYNTAX ERROR, REQUIRED: [PRODUCT NAME], [PROPERTY]
NL Queries:What is the price of Apple iPhone 5s?
● CB Queries:Which among apple iphone, samsung galaxy has the best price?
Natural Language Based Queries : In the template based queries, we directly get the product and
property names that the user is interested in. But in NL queries we need to extract them from the given sentence.This needs a pre-processing step, which involves removal of stop
words.The probability that the property name comes before the product
name is very high due to syntactic constraints in the English language.
So, keeping this in mind the property name and product name are extracted.
Approach The approach followed:
We maintain three lists namely - product name list, brand name list and property list.
We extract the brand of the product from the given query by iterating through the brand name list using edit distance algorithm which also helps in handling spelling errors or typos.
Elements of product_name list are tuples of size 3 - brand, model name and category.
After the extraction of the brand name, we consider only those products from that particular brand for further processing.
Contd..Approach continued:
For determining the exact model name and the property name, we use a similar approach but add an additional similarity measure along with the edit distance as mentioned before.
The second similarity measure is calculated by dividing the maximum length of the two strings by the number of character matches between the strings.
We take the harmonic mean of the edit-distance score and the above metric to get a final similarity measure.
We take the top 10 results for products and the best one for property.
Similarity Measure - Edit Distancedef editDistance(word1, word2):
len_1=len(word1)
len_2=len(word2) x =[[0]*(len_2+1) for _ in range(len_1+1)]
for i in range(0,len_1+1):
x[i][0]=i
for j in range(0,len_2+1): x[0][j]=j for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1] else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
return x[i][j]
Similarity Measure - 2def similarityMetric(word1, word2):
w1_c = [0]*256
w2_c = [0]*256 for i in word1:
w1_c[ord(i)] += 1
for i in word2:
w2_c[ord(i)] += 1
matched_words = 0
for i in xrange(256):
matched_words += min(w1_c[i], w2_c[i])
if (matched_words == 0):
return 99999
return (max(len(word1), len(word2)) / float(matched_words))
For every product name from the top 10, we query the database and obtain respective results and display to the user .
Querying the Database
Since scraping the data requires sending a lot of requests to a server, there is a chance that the server may temporarily block or blacklist our IP. But fortunately, we did not come across this problem.
Initially we hand picked a few specific properties, but as we expanded the domain we had to include all the available properties. This posed a problem, when it came to answering queries that asked for a particular property, but the corresponding data was stored under a synonymous or extended name. This can be handled using Synsets (WordNet).
Challenges
Results
Thank You