sp_whitepaperof northern light
TRANSCRIPT
-
7/29/2019 SP_WhitePaperof Northern Light
1/27
Northern Light SinglePoint Market Research Portal
Overview White Paper
September, 2003
One Broadway, 14th Floor, Cambridge MA 02142
617-242-5960
Copyright 2003, Northern Light Group, LLC, All Rights Reserved
-
7/29/2019 SP_WhitePaperof Northern Light
2/27
Table of Contents
Background ___________________________________________________________ 3
SinglePoint Market Research Portal Overview _______________________________ 5
Custom Content Integration ______________________________________________ 7
Search Technology Architecture History ____________________________________ 9
Search Technology Architecture Overview__________________________________ 11
Documents _______________________________________________________________________ 11
Metadata and Metatags ____________________________________________________________ 11
Data Model_______________________________________________________________________ 12
Data Collection and Web Crawling ___________________________________________________ 13
Search Service Description __________________________________________________________ 13
Query Database __________________________________________________________________ 13
Automatic Classification___________________________________________________________ 14Northern Light Taxonomy__________________________________________________________ 14
Sample Taxonomy: Aerospace____________________________________________________ 15
Sample Taxonomy: Construction__________________________________________________ 16
Indexing________________________________________________________________________ 21
Query Service ___________________________________________________________________ 21
Searching_______________________________________________________________________ 22
Relevancy Ranking _______________________________________________________________ 23
Custom Search Folders ____________________________________________________________ 24
Session Management______________________________________________________________ 24Security ________________________________________________________________________ 24
Alerts __________________________________________________________________________ 25
Applications Development and Hosting ____________________________________ 26
Northern Light Technology Awards _______________________________________ 27
Copyright 2003, Northern Light Group, LLC, All rights reserved2
-
7/29/2019 SP_WhitePaperof Northern Light
3/27
Background
Northern Light was founded in 1996 with the objectives to: (i) unify all of the best content in the
world into one database, (ii) build search technology that allows searchers to easily find the mostrelevant (not just the most) information within that database, (iii) build search technology that
works for both the first time untrained user and the information professional, and (iv) create a set
of tools to allow Northern Light to build and operate custom information solutions for businesses
that utilize these capabilities. This mission was formed from the following observations:
Most interesting questions have relevant content from many sources, i.e., the Web, news
feeds, licensed research, journal archives, and internal corporate information.
Given the penetration of Internet technology on corporate networks, there is no longer any
technical or distribution barrier to making all digital information available from any desktop
computer.
Unstructured information is overwhelmingly the most common type. It is impossible in any
large search application to know what the organization, metatags, or document structure will
be in advance.
People in all walks of life are search engine literate and use search engines to access
information on a daily basis. There is no other corporate application that requires less
training or support than a search application.
In particular, the key problem for search engine information retrieval is that of producing a
precise set of relevant documentsfewer good documents, not more useless ones.
To accomplish its mission, Northern Light set out to meet the following goals:
Build a continually growing stable of content integration technology that would allow Northern
Light to create databases of greatly diverse content,
Make use of the best existing technology and develop new technology for highly scalable and
precise searching,
Develop highly scalable automated classification and related technologies to use pre- and
post-search because even the best query interpretation and relevancy ranking are frequently
inadequate to answer an information need expressed in one or a few words against a
database of a billion documents or more, and
Copyright 2003, Northern Light Group, LLC, All rights reserved3
-
7/29/2019 SP_WhitePaperof Northern Light
4/27
Become extraordinarily proficient at dealing with unstructured data, bringing accessibility,
usability, organization, and classification to arbitrarily large and diverse content sets from
thousands of sources.
NorthernLight.com formally launched in August 1997, and was the first Internet search engine to
offer access to both Web, published, and internal corporate content in a single database.
Copyright 2003, Northern Light Group, LLC, All rights reserved4
-
7/29/2019 SP_WhitePaperof Northern Light
5/27
SinglePoint Market Research Portal Overview
Most large companies license studies, analysis, and commentary from dozens of market research
firms, trade journals, periodicals, equity analyst investment reports, and newswires. When there
is the need to learn about a market or a product or a competitor, searching all the licensed
sources is a major problem. How does the researcher, who may be a marketing, sales, or
product development professional rather than a librarian, search dozens of market research and
other sources? Does he or she log in dozens of times with dozens of different user names and
passwords? Learn dozens of user interfaces on dozens of different Web sites? Run dozens of
searches? Manually collate dozens of results lists?
Another challenge confronting organizations that license content from many sources is the
utilization of that content. Most of us are very impatient with search processes as a result of use
of Internet search engines like Google. As end-users, we know that it is technically possible to
have all the relevant information in a single database. Individuals will simply not preserve to
search several sources. The most popular and obvious one will be searched, then maybe one
more, on rare occasion three sources might be searched. The fourth through the hundredth
licensed services are rarely consulted.
Northern Light SinglePoint offers a more efficient and effective way to use all of a companys
licensed content with one login, one search, one integrated results from all licensed sources.
Companies can even include search of internal market research and studies so users can access
relevant material produced within the organization while simultaneously looking at external
sources. Using SinglePoint, it is also possible to include Web content in the database. (Is any
search today complete without consulting relevant Web sources?) For example, vertical search
of competitors websites, trade association websites, and government regulatory agency websites
are highly relevant to many marketing and product planning decisions.
The outcome: complete research from many sources, in the same time it now takes to search just
one source. How much is the time of every professional employee in an organization worth?
Also, the utilization of licensed content soars when a SinglePoint application is deployed.
Sources are used based on their relevance, not on how familiar the staff is with them.
The salient features of SinglePoint include:
Content: All licensed, internal and Internet content from any source, in any format, from any
organization anywhere in the world. Whatever the sources are, we can put them into one
integrated index.
Copyright 2003, Northern Light Group, LLC, All rights reserved5
-
7/29/2019 SP_WhitePaperof Northern Light
6/27
Organization: All sources indexed and subject classified to a consistent standard
User interface:
- One login
- One user interface
- One search
- One results list
Seat management: Enforce access privileges by user or group to sources, groups of
sources or individual documents.
Outsourced turnkey solution: Northern Light develops, hosts and maintains the
SinglePoint to help keep overhead down and minimize impact on your internal IT resources.
Security: Private database and private Web server via secure API, VPN or T1. User name
and passwords can be used, as well as IP validation. Security is customized to meet specific
client requirements or integration with corporate network security systems is available.
Copyright 2003, Northern Light Group, LLC, All rights reserved6
-
7/29/2019 SP_WhitePaperof Northern Light
7/27
Custom Content Integration
The heart of a SinglePoint application is custom content integration. It is not unusual to integrate
dozens, even scores, of sources into a SinglePoint market research portal.
The integration normally involves these steps:
Determine the licensing arrangements of our customer with each of the sources. Northern
Light has to understand whether the content is available on an enterprise-wide flat rate or is
limited to a certain number of seats or limited in some other way. Most research vendors
offer a wide variety of options in their offerings, and Northern Light has to understand which
options have been selected by our customer.
Work with the market research vendor to understand the structure of the vendors contentand network, and to determine how to acquire it to build our customers index. Vendors most
often today use an http-based extranet that we can crawl (with the vendors help getting
through the firewall). Alternatively, FTP file transfers are a common technique for acquiring
the content. Northern Light will then set up automated processes to acquire exactly what our
customer has licensed from each vendor.
Write filters to convert the vendors content to the Northern Light load format. This involves
determining how to capture the vendors metadata (or metatags) so that the metadata can be
included in the index (and hence available to select, sort, and filter results).
Determine the login required for the vendor to fulfill document requests and setting up our
transaction system to be able to transmit documents from the vendor site to the browsers of
the end-users of our customer.
Determine if any internal content is intended for the database. This can be internal market
studies, MS PowerPoint presentations, or locally held copies of licensed market research. If
internal content is to be included, crawling or FTP file transfers procedures must be arranged.
For small volumes of internal content, Northern Lights Automated Submission and Publishing
(ASAP) system can used to easily publish documents to the database.
Determine if any Web content is desired for the database. Popular choices include
competitors websites, trade association websites, and government regulatory agency
websites. Subscription websites and e-journals can be included in the database, e.g., trade
journal sites or industry publication sites.
Copyright 2003, Northern Light Group, LLC, All rights reserved7
-
7/29/2019 SP_WhitePaperof Northern Light
8/27
Index and classify the content, creating the comprehensive multi-vendor index for our
customer to search.
Load the content every day, create the index, and serve the end-user queries. Northern Light
disposes of our copy of the vendors content (or of internal content) a few days after the load
process is complete. (We do maintain a copy for a few days so we can re-index a recent
load if there turns out to be a problem of any sort.) Note that a SinglePoint database index is
unique to a specific customer, facilitating security and usability.
Once the comprehensive multi-vendor/multi-source database index has been built, end-users
may query from UIs provided on their intranet or from a UI we can provide as a private website.
Results returned are from all the content in the database across all of the licensed vendors,
consistently relevance ranked and indexed. When an end-user wants a document, he or she
clicks on the link just like any other search engine. We then instantly, transparently, and
automatically authenticate that the end-user has rights to the document, log-in the end-user in to
the vendor-in-questions service, fetch the document, and put the document in the browser
window of the end-user. The transaction system of the vendor records the event as if the end-
user had logged in.
Below is a diagram of the content integration process.
Northern Light Customer
Copyright 2003, Northern Light Group, LLC, All rights reserved8
W s
User
Interface
Integrated
IndexCrawler
Trash
Internal
Repository
Internal
ReportsMkt.
Research
Investment
reportsIndustry
ebsiteMkt. Research
Newswires &
journals
Full-text content
Computer
generated data
-
7/29/2019 SP_WhitePaperof Northern Light
9/27
Search Technology Architecture History
Northern Lights initial strategy was to provide a service to site visitors that would generate
revenue from advertising and from sales of licensed business information library known as the
Special Collection of over 7000 journal and periodicals sources. In order to offer this service,
Northern Light had to master the Web and hundreds of publisher formats, and integrate all of the
content into one database, normalize it from an indexing and classification viewpoint, and search
it all with one query. The total database approached 400 million documents and it still may be the
largest business research library ever assembled.
The service (the search engine then available at www.NorthernLight.com) was warmly received
by existing consumers of high-quality, fee-based information: professionals in corporations,
educational institutions and governments. As a result, a marketing effort was directed toward the
enterprise market through high volume sales to organizations.
Northern Lights unique focus on both Web and non-Web data coupled with its ability to sell non-
Web documents made it of interest to organizations that wanted to be directly involved in offering
their own data or services. As a result, the initial strategy was combined with a gateway
partnership component that typically paired Northern Light with other organizations in the offering
of co-branded, specialized search Internet-based sites.
Northern Lights enterprise strategy now includes a suite of customized search solutions based
on the Northern Light technology platform that creates value for a large range of organizationsand situations. These solutions include:
Custom intranet portals that feature search of customized content sets of licensed, internal,
and Web sources
Search-based services for extranets,
Hosting and sale of archival documents for publishers,
Search-of-site for Web-based services, and
Full-scale custom information products (e.g., for the U.S. government), etc.
These information management products reflect the companys core competencies in search
technology, classification and taxonomy development, and integration of diverse content and
federated search. Northern Light typically offers these services on an outsourced, ASP basis.
Copyright 2003, Northern Light Group, LLC, All rights reserved9
-
7/29/2019 SP_WhitePaperof Northern Light
10/27
However, the Northern Light search technology is also available as licensed software for in-house
customer use on Solaris and Linux platforms.
All of Northern Lights solutions derive from the original vision expressed in our Web search
engine of one database that could provide access to all the worlds useful information. All of our
solutions share these characteristics
Scale to gigantic numbers of documents.
Speed to efficiently search such large databases.
Precision, or relevance ranking, to make the large databases useful.
Classification to automatically organize the body of unstructured content in useful ways.
Copyright 2003, Northern Light Group, LLC, All rights reserved10
-
7/29/2019 SP_WhitePaperof Northern Light
11/27
Search Technology Architecture Overview
Documents
The basic unit of the Northern Light database is a document. Documents are most normally text
based, even though they appear to the database just as objects with a URL so they could be any
media. Each document is viewed as a multi-dimensional object that may have one or more
values for a number of different fields, attributes, dimensions and/or domains (terms used
interchangeably).
One such special field is the display-object the viewable document itself, generally stored in the
form it was delivered to Northern Light. The display object is generally not retained by Northern
Light unless there is an arrangement to do so. Most often, the display object resides where it was
crawled, and the URL in the Northern Light database points to it there.
Metadata and Metatags
Other fields, or values for all other fields - title, author, creation date, subject, source, etc. are
generally referred to as metadata,data about the document. The term metadata sometimes
applies to the actual values a document may have for each of these fields, as well as to the
named fields themselves. Certain metadata is required for any document, including a display
object, title, etc. Metadata is also used for Custom Search Folders and other classification-based
browsing and searching, and includes key attributes such as subject, type, source, language and
region.
This metadata is sometimes present in the document itself (in which case the metadata might be
called metatags) and sometimes it is generated by the Northern Light technologys auto-
classification capabilities. Unique metadata of custom data types, e.g., internal documents, may
be captured by the content loading filter so that they will always be present for use in
classification and searching.
A document can have multiple values for a single field e.g., multiple subjects, multiple authors.
All metadata is represented in the database index, which makes it available for filtering,
organizing, and sorting at query time.
Copyright 2003, Northern Light Group, LLC, All rights reserved11
-
7/29/2019 SP_WhitePaperof Northern Light
12/27
Data Model
The standard metadata used for Custom Search Folders subject, type, source, language and
region is treated specially in a number of ways. The possible values for each of these fields
have been defined and comprise a taxonomy or set of possible values for that field/domain.
These taxonomies are all hierarchical, though they contain many cross-references; a given value
may have more than one parent because each taxonomy is actually a directed acyclic graph.
The subject taxonomy contains approximately 17,000 values (referred to as nodes), starting at
the top level with broad categories such as humanities, and sometimes going eight or more
levels deep in certain areas to provide very specific subject values such as works of W.H.
Auden or robotics.
The type field refers to the kind of document an article (the default and most populated type),
a review (with more specific typing of book review and others), an editorial, a letter, a report,
something for sale, etc.
The source field refers to where the document came from, and is either a Web source of some
kind (e.g., a Web site, or possibly higher level source node such as all commercial sites) or a
Special Collection source typically a single journal or book title at the lowest level (e.g., The
Economist, or the Boston Herald) or, again, a higher level aggregate (e.g., journals and
magazines, news articles, etc.).
The language field is the predominant language(s) of the document currently one of English,
French, German, Spanish, Italian and unknown (i.e., some other language).
The region field specifies a location or locations referenced in the document a city, country,
geographic region, etc.
For type, language and subject, the metadata value(s) attempt to capture what the document is
really about (or substantially written in, in the case of language). Multiple values are possible but
these are intended to represent true multi-subject documents. In the case of region, the model is
slightly different; a document will be tagged with any and all regions that can possibly be
identified with a document. The difference between region and these other fields is the way they
are used in searching.
This multi-dimensional model is in contrast to a single dimensional model that must rely on
repetition within a single domain in order to achieve comprehensive document descriptions. For
example, in a single dimensioned design, values like reviews or biographies could be repeated
under all or a very large number of subject values. As the amount of metadata increases, the
Copyright 2003, Northern Light Group, LLC, All rights reserved12
-
7/29/2019 SP_WhitePaperof Northern Light
13/27
single (or few) domain model becomes increasingly complex and unwieldy. The Northern Light
multi-dimensional model, however, can maintain multiple taxonomies easily and simply and can
class and organize documents against them.
Data Collection and Web Crawling
Using SinglePoint, any content in any format located on any computer anywhere in the world in
the possession of any organization can be put into the database. From a technical viewpoint,
data flows into the database via crawling (if the content is on an http platform such as the Web or
an intranet), licensed feeds, or by FTP file transfer.
Data is converted to a standard Northern Light format that captures the document itself plus
associated metadata, including title, date, and anything else that the customer wants to have
captured, such as document type. Since data typically arrives in non-HTML format, part of this
conversion involves changing the document text itself (often in tagged ASCII, SGML or other
formats) into HTML. Northern Light has, to date, converted over 200 different data formats,
including the older tif-wrapped PDF and documents rendered as images. (Images are
processed with programmatic OCR to make the content available for indexing and full text
search.)
In the case of certain third-party content licenses specifically for one or more SinglePoint
implementations (such as market research content from vendors such as Gartner, IDC or
Forrester), Northern Light keeps the content only as long as it takes to create the necessary
indexes. Once the content has been completely indexed, it is discarded, making it impossible for
Northern Light to re-create the full-text of these content sources.
Search Service Description
Query Database
Once data is placed in a standard Northern Light format, it is loaded into a Northern Light
database. Loading may be a misleading term because the content does not actually reside in the
Northern Light Light database. Loading refers subject classifying the documents and indexing
them. For SinglePoint applications, the index of classified documents is unique to an individual
corporate customer. End-users send queries to the index and the results lists are generated
from the index. The content itself is not touched during querying, indeed, for many applications
Northern Light actually disposes of its copy of the content after it is loaded, remembering in the
Copyright 2003, Northern Light Group, LLC, All rights reserved13
-
7/29/2019 SP_WhitePaperof Northern Light
14/27
index, of course, where we got it from so that the document can be retrieved if an end-user
desires to read it after selecting it from a results list.
Automatic Classification
To deliver automated classification against a huge and heterogeneous data set, the Northern
Light technology uses our own classification taxonomies for subject, type (e.g., article, review,
FAQ, job listing, etc.) and other document attributes, drawing on existing taxonomies and
supplementing them to provide comprehensive coverage for a wide range of users. An automatic
system has also been built that uses multiple strategies (e.g., pattern extraction from training
documents, co-location analysis, and structural elements) for classifying documents for a given
attribute. Both the taxonomies and the automated system have been in production and supporting
end users since August, 1997, have classified over a billion documents, and are continually being
refined to deliver more comprehensive and precise classification and better operational
performance.
At this point, Northern Lights automatic classification is still the only system to ever automatically
subject classify the World Wide Web. Performance levels have been achieved by fully divorcing
the logical classification models from their practical implementation and creating data structures
appropriate for rapid classification of documents against the large but relatively stable
taxonomies, patterns, and rules that are the basis of the classification process.
One primary use of classification information (i.e., metadata) at Northern Light Integration today is
to organize the results (through Custom Search Folders) of a search by appropriate attributevalues. This facilitates rapid navigation and some level of automatic query refinement, while
allowing more expert users to limit their search initially by some appropriate attribute value.
Metadata is also used as one factor (among many) in relevancy ranking.
Northern Light Taxonomy
Subject classification has been designed to classify a document to the one or a small number of
subjects from our 17,000+-term subject taxonomy that a document is truly about (vs. classifying
to all subjects that occur in the document). The system can today subject classify approximately
25% of random Web documents (about what human editors are able to do) at accuracy rates of
from 90-95% using user/customer appraisals. These coverage and accuracy rates are
significantly better for non-Web documents.. Classification coverage and accuracy have been
realized by continually engineering and extending both known and novel technologies in light of
specifically identified problems.
Copyright 2003, Northern Light Group, LLC, All rights reserved14
-
7/29/2019 SP_WhitePaperof Northern Light
15/27
Below are two examples of taxonomy branches, one of aerospace technology and industry. The
other of construction technology and industry..
Sample Taxonomy: Aerospace
The node identifier is the ID#, and the @ sign indicates inclusion by reference of other branches
of the taxonomy.
Aviation & space technology ID#18340
Aerodynamics ID#18341
Aeronautics ID#18342
Flight control & navigation ID#18368
Aeronomy ID#39560
Aerospace communications equipment ID#18367
Aerospace materials ID#18344
Air traffic control ID#38332
Aircraft design & construction ID#18346
Aircraft engines & motors ID#18356
Aerospace propulsion ID#18395
Jet engines ID#18357
Rocket engines ID#18359
Aviation instrumentation ID#18382
Commercial aircraft design ID#18347
Flight simulators ID#17478
Flight testing ID#18369
Gliders ID#18351
Helicopters ID#18352
Homebuilts & ultralights ID#18349
Hot air balloons ID#18350
Landing gear ID#18383
Military aircraft ID#18353
Seaplanes ID#18354
Small planes ID#18355Astronautics ID#18363
Space systems ID#18412
Astrophysics ID#18366
@Celestial mechanics (ID#14437) ID#13928
Aviation ground facilities ID#18372
Airport planning & design ID#18373
Copyright 2003, Northern Light Group, LLC, All rights reserved15
-
7/29/2019 SP_WhitePaperof Northern Light
16/27
Military aircraft ground facilities ID#18375
Spacecraft ground facilities ID#18377
Avionics ID#37848
Civil aviation ID#37768
@Flight control & navigation (ID#14428) ID#18368
History of aviation & space technology ID#18378
History of aviation ID#18379
History of space flight ID#18380
@National Aeronautics & Space Administration (NASA) (ID#14427) ID#10135
@Remote sensing (ID#14429) ID#13629
Satellite technology ID#18402
Communications satellites ID#18404
Space stations ID#18411
MIR space station ID#36642Space travel & exploration ID#18413
Space colonization ID#18405
Spacecraft & Space missions ID#18406
Apollo space missions ID#18407
Gemini space mission ID#18408
Manned spacecraft ID#39606
Mercury space missions ID#18409
Space Shuttle ID#18410
Space launch vehicles & equipment ID#39607
Space probes ID#18389
Space safety ID#39608
Unmanned spacecraft ID#39609
Viking mission to Mars ID#29129
@Telescopes (ID#14432) ID#13198
Sample Taxonomy: Construction
Architectural engineering ID#18323
@Architectural design (ID#40386) ID#358
Building acoustics ID#18324
@Construction engineering (ID#14424) ID#18511
@Construction management (ID#14425) ID#18512
Heating, ventilation & air conditioning ID#18331
@Air conditioners & fans (ID#41168) ID#14698
@Home furnaces (ID#43116) ID#37672
Copyright 2003, Northern Light Group, LLC, All rights reserved16
-
7/29/2019 SP_WhitePaperof Northern Light
17/27
Lighting & electrical systems ID#18333
Commercial lighting ID#14840
Exterior lighting ID#14767
@Lighting design (ID#42055)
@Structural engineering (ID#14426)
Architectural services ID#4574
Architectural drafting ID#4576
House plans ID#37844
@Landscape architecture (ID#40544)
Lighting design ID#27266
Asbestos ID#5305
@Asbestos exposure (ID#40679)
@Asbestos removal (ID#40570)
Civil engineering ID#18509Bridge engineering ID#18510
Construction engineering ID#18511
Building standards & codes ID#39565
Construction automation ID#18513
Construction management ID#18512
Construction safety ID#6194
@Dams, canals & waterways (ID#14473)
Earthworks engineering ID#18518
Fire technology ID#28435
Combustion & flammability ID#28439
Fire investigation ID#28448
Fire prevention ID#28450
Fire safety systems ID#19295
Fire suppression ID#28441
Geotechnical engineering ID#18520
Earthquake engineering ID#39468
Geo-environmental systems ID#13643
Geosynthetics ID#13644
Hydraulic engineering ID#18524
Coast & Harbor engineering ID#18525
Flood control ID#18527
@Hydraulic cement (ID#43096)
@Hydraulic fluids (ID#43328)
Copyright 2003, Northern Light Group, LLC, All rights reserved17
-
7/29/2019 SP_WhitePaperof Northern Light
18/27
@Hydraulic machinery (ID#41488)
Hydraulic structures ID#18528
Aqueduct engineering ID#37887
Dams, canals & waterways ID#18530
Reservoir engineering ID#18532
Irrigation & drainage ID#13032
Sediment transport ID#18535
Surface water runoff ID#18536
Lighthouses ID#18540
@Ocean engineering (ID#14472)
Structural engineering ID#18547
@Mechanical behavior of materials (ID#14474)
Structural concrete ID#18551
Structural steel ID#18554Surveying ID#18555
@Geographic information systems (GIS) (ID#14475)
Photogrammetry ID#18560
@Remote sensing (ID#14476)
Transportation engineering ID#18565
@Automotive engineering (ID#41482)
Electric vehicles ID#36914
Emission control ID#18569
High-speed ground transportation ID#18570
Highways, roads & pavements ID#18522
Intelligent transportation systems ID#18574
Marine transportation ID#19488
Pipeline transportation ID#39656
Railroad engineering ID#18544
Transportation planning ID#18575
Transportation safety ID#39610
@National Transportation Safety Board (NTSB) (ID#40938)
@Urban transportation (ID#41339)
Tunnel engineering ID#18577
Construction industry ID#26320
Building contractor services ID#4725
Building materials ID#39453
@Carpentry & woodworking (ID#41390)
Copyright 2003, Northern Light Group, LLC, All rights reserved18
-
7/29/2019 SP_WhitePaperof Northern Light
19/27
@Construction machinery (ID#41486)
@Driveway coating & construction (ID#42066)
@Electrician services (ID#41176)
@Fences & stone walls (ID#40582)
Hand & power tools ID#14756
@Home improvement centers (ID#43309)
@Insulation services (ID#41174)
@Landscaping services (ID#41177)
Nonresidential Construction ID#39628
Plumbers & plumbing supplies ID#14751
@Bathroom fixtures & accessories (ID#41169)
@Pool construction & maintenance services (ID#41179)
Residential construction ID#39449
@Roofing services (ID#41180)@Septic systems (ID#42067)
@Underwater construction & Habitats (ID#43333)
@Water well drilling (ID#42047)
Facilities management ID#6270
Floor laying, refinishing & resurfacing ID#27352
Heating & Ventilation industry ID#26835
@Heating, ventilation & air conditioning (ID#41445)
House painting & wall covering services ID#14789
Industrial equipment & Heavy machinery industry ID#26341
@Farm equipment & Supplies industry (ID#41904)
@Manufacturing equipment & machinery (ID#41485)
@Turbomachinery (ID#41508)
Lighting industry ID#26349
Electrician services ID#14784
Electrical supplies ID#14744
Electrical testing & inspection ID#18674
Electrical wiring ID#18675
@Lighting & electrical systems (ID#41446)
@Lamps & light fixtures (ID#40577)
@Lighting design (ID#42054)
Laminated wood ID#27394
Particle board ID#27398
Plywood & veneer ID#5068
Copyright 2003, Northern Light Group, LLC, All rights reserved19
-
7/29/2019 SP_WhitePaperof Northern Light
20/27
Pressure treated wood ID#27401
Sheet metal ID#5124
Wire & Cable products ID#39463
Aluminum & aluminum products ID#5112
Copper & copper products ID#5116
Iron industry ID#26357
Steel industry ID#26358
Stainless steel ID#5125
Paint & paint supplies ID#14748
Property developers ID#5232
Rock mechanics ID#39569
Soil science & technology ID#18312
Erosion ID#18313
Fertilizers ID#18316Chemical fertilizers ID#17611
Organic fertilizer ID#17612
Soil cultivation ID#18315
Soil pollution ID#13029
Soil remediation ID#37889
Stone, clay, glass & concrete product industry ID#26373
Cement ID#5306
Hydraulic cement ID#36456
@Ceramics & Pottery (ID#40405) ID#797
Concrete ID#5307
Concrete block & brick ID#36460
Ready-mixed concrete ID#36461
Cut stone & stone products ID#5308
Granite ID#36462
Limestone ID#36463
Marble ID#36464
@Memorials & Grave stones (ID#42060) ID#27319
Earthenware ID#5309
Glass products ID#5311
Automobile glass ID#36478
Flat glass ID#36465
Glass containers ID#36466
@Mirrors (ID#40565) ID#4879
Copyright 2003, Northern Light Group, LLC, All rights reserved20
-
7/29/2019 SP_WhitePaperof Northern Light
21/27
Pressed & blown glass ID#5315
@Sand & Gravel (ID#41514) ID#19469
Vitreous china ID#36475
Windows & doors ID#4998
Indexing
During indexing, all of the terms in the documents and metadata fields are extracted and indexed
into appropriate index structures; searches can be resolved by using these structures and without
having to refer to the original documents. This process is exhaustive and comprehensive; all
visible terms in the document display object and all appropriate metadata values are indexed.
There are no stop words or special characters that are not indexed and there is no practical cut-
off in terms of document length at which point indexing stops.
All search terms (including those inside quoted phrases) are viewed as nouns and transformed
(stemmed) automatically to their common singular form during indexing (and at query time). This
allows a search on a singular or plural noun to find occurrences of either. The stemming rules are
fairly simple and do not handle most irregular forms, which tend to occur for very common words
not generally useful in searching.
All terms are indexed as all lower-case letters. In addition, all terms containing at least one
instance of both upper- and lower-case are also indexed in a special case-sensitive index. This
allows queries to find all instances of a term regardless of case; query terms are also translated
to all lower-case for initial query resolution. This also allows a match in case-sensitivity with aquery term to be used as a relevancy factor.
The above rules are English-language dependent. However, given the symmetry with which they
are applied at indexing and query time, they generally preserve appropriate search processing for
all languages. Double-byte language support, which requires the licensing of an third-party
language processing module from Teragram, includes language-sensitive stemming and other
sorts of processing.
Numeric tokens (or tokens of mixed letters and numbers) are indexed as text. Proximity
information can be represented in indices in various ways, allowing either very fast access of
short phrases, or complete and precise (but slower) access of phrases of unlimited length.
Query Service
Finished databases are connected to Northern Lights Northern Light network by the Query
Listener (QL). The QL accepts queries from external clients (such as a Web server) and passes
Copyright 2003, Northern Light Group, LLC, All rights reserved21
-
7/29/2019 SP_WhitePaperof Northern Light
22/27
them to the Query Server (QS). The QS translates the search syntax and other parameters,
queries the database indices, and returns the appropriate citation information and metadata. The
Query Listener is also responsible for identifying itself by broadcasting, via UDP, a database
identifier and load information. Clients use this information to select a listener appropriate to their
mission.
Searching
Nearly all search fields on all search forms accept and process searches in the same way. The
query interpretation algorithm proceeds as follows:
1. If the query is well-formed Boolean expression, it is rigorously interpreted as such.
Boolean expressions can contain AND, OR, NOT, simple terms (words), quoted
phrases, wildcards and parentheses including sub-expressions, and may contain an
unlimited amount of nesting. In addition, a Boolean expression may itself contain any
number of fielded sub-expressions that specify a search against a particular metadata
field, e.g., (lawsuit or sue) and title: microsoft or netscape. By default, terms are
searched against the text field, which includes all full text and all document metadata.
This field may also be specified by use of the text: keyword. Search terms may also
include one or more trailing or multi-character wildcards (indicated by *) or single-
character wildcards (%) as long as there are at least four non-wildcard characters before
the first wildcard, e.g., rachm%inof*. Search fields are available to the user through
appropriate fields on search forms or through the field: syntax. Other special fields
include relational operators that can be used for date fields, sort: date (a reverse
chronological sort), or sort: relevancy (the default).
2. If a well-formed Boolean expression is not found and the query is more than a specified
length (currently 12 terms), a statistical query evaluation process is used. This only
requires the presence of a single term in the query for a document to appear on the
results list, but makes use of all terms or phrases appearing in a document to determine
the best documents. This statistical evaluation can also be forced on a query of any
length by preceding the query with the pseudo-field like:, e.g., like: side effects of anti-
depressants and sedatives
3. If neither of the above two conditions is met, then a query with any use of the + or -
operators common among Internet search engines will be interpreted according to
generally accepted rules. The rules are that any term or quoted phrase immediately
preceded by a + must be in a document to appear on the results list, and any term or
Copyright 2003, Northern Light Group, LLC, All rights reserved22
-
7/29/2019 SP_WhitePaperof Northern Light
23/27
quoted phrase preceded by a - cannot be in any document for it to appear on the results
list. Other terms in the query are considered desirable but not required.
4. If the query does not meet any of the above criteria, a fuzzy search is performed. This
does an implicit AND of most content-bearing words (or what are generally non-content
bearing words if those are the only query terms) but uses all terms entered for relevancy
ranking purposes. Some limited natural language analysis is also performed on terms,
such as recognition of the word not.
All query terms are presumed to be nouns and are translated, if necessary, to their singular form
using fairly simple algorithms. This allows a match against either a singular or plural form, since
all document terms are similarly converted to singular form during indexing. Query terms are also
translated to lower case in order to be able to match any form of the word in any document; all
document terms are similarly converted to lower-case at indexing time. Mixed case terms are
searched against a special mixed case index to provide information about case-sensitive matches
for relevancy ranking.
Relevancy Ranking
One of the strengths of Northern Lights technology is its advanced relevancy ranking algorithms.
These not only provide a novel approach to ranking but are based on highly optimized index
structures and algorithms that allow Northern Light Search and Content Integration to perform
significant relevancy ranking operations on a very large database.
Ranking takes into account several different factors, each of which contributes weight to a
documents overall relevancy score and to its eventual placement in the results list. A maximum
theoretical relevancy score is calculated for every query, and displayed relevancy scores
represent a simple transformation to a 1-99% range of the actual document score as compared to
the maximum theoretical score. These factors include the following:
Number of occurrences of matching terms (term frequency factor, or TF).
Relative frequency of those terms in the entire database (term inverse document frequency,
or IDF).
Implicit phrase recognition
Location of matching terms and phrases in document metadata
Copyright 2003, Northern Light Group, LLC, All rights reserved23
-
7/29/2019 SP_WhitePaperof Northern Light
24/27
Number and authority of external sites linking to this document (applies to Web documents
only)
Date of the document (All other things being equal, more recent documents are considered to
be more relevant than older documents.)
Classification metadata
Document length
Presence of any detectable spam
Custom Search Folders
For any result set containing more than 25 results, Northern Light determines a set of Custom
Search Folders (CSFs) before returning the results to the user. To do this, Northern Light
examines the metadata values of the documents on the results list, uses those values to
determine appropriate CSFs, weighs each of the CSFs and then displays the top-weighted CSFs.
Weighting of CSFs is determined by a number of rules that contribute different values to the
overall weighting of that CSF. Certain rules assign weights based on how many documents are in
the CSF, or how many of its documents rank high on the results list. Other CSFs assign values
based on how different the CSFs are from other candidate CSFs, or based on the more exact
nature of the metadata values involved.
Session Management
Northern Lights Northern Light service uses a proprietary session state management system to
store user state between page requests or other transactions. Each user is given a cookie with a
unique token that contains no outwardly useful information. The users browser transmits the
token in the headers of each request, and the Northern Light software uses the token to retrieve
session data.
Security
Northern Light can invoke a variety of security solutions that the option of the customer. If a user-
name password scheme is desired, we provide an administrative user-interface for managing the
passwords. IP validation can be used in lieu of username/password security, or in addition to it.
Secure https protocols are customarily used with Verisign certificates insuring the validity of
Copyright 2003, Northern Light Group, LLC, All rights reserved24
-
7/29/2019 SP_WhitePaperof Northern Light
25/27
connections, or leased T1 lines can be used for extreme security. In our 7- year history, Northern
Light has never experienced a single hacker intrusion into customer applications or our network.
Alerts
Northern Light offers users and enterprise customers the ability to save any search and have it
run automatically whenever the database referenced by the search is updated. At that time, an e-
mail is sent to the registered owner of the alert if and only if there are new documents in the
database that meet the search criteria; the e-mail message contains a link to just these new
results. The system further keeps track of when a user has actually accessed these new results
so that, if a user receives a string of alert e-mails before being able to see any of them, accessing
any one of them will provide all the new results since the users last access; it is unnecessary to
cycle through the alert messages one at a time in order to see all new results.
Copyright 2003, Northern Light Group, LLC, All rights reserved25
-
7/29/2019 SP_WhitePaperof Northern Light
26/27
Applications Development and Hosting
Northern Light offers complete services for producing, or letting customers produce, custom
search applications to be run either in ASP mode, in-house at a customer site, or some
combination of the two. These can include documented APIs for searching, customized results
lists, alerts and other capabilities (usually through XML interfaces). Northern Light also has a
dedicated applications group, efficient tools for rapid development of user interfaces and, behind
it all, a 7x24 secure operations facility.
Copyright 2003, Northern Light Group, LLC, All rights reserved26
-
7/29/2019 SP_WhitePaperof Northern Light
27/27
Northern Light Technology Awards
Top 100 eContent magazine, December 2001
Best of the Web US News and World Report, October 2001
Top 100 Companies That Matter KMWorld magazine September 2001
Editors Choice PC Magazine, November 2000
Best of the Web, Forbes magazine, September 2000
Web Business Award For Online Excellence CIO magazine, July 2000
Best Online Business/Professional Service, Software & Information Industry Association, March2000
Best Online Research Product, Software & Information Industry Association, March 2000
Best Online Information Service, Software & Information Industry Association, March 2000
Editors Choice PC Magazine, September 1999, 1998, 1997