sp_whitepaperof northern light

Upload: ramdasdarade

Post on 04-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 SP_WhitePaperof Northern Light

    1/27

    Northern Light SinglePoint Market Research Portal

    Overview White Paper

    September, 2003

    One Broadway, 14th Floor, Cambridge MA 02142

    617-242-5960

    Copyright 2003, Northern Light Group, LLC, All Rights Reserved

  • 7/29/2019 SP_WhitePaperof Northern Light

    2/27

    Table of Contents

    Background ___________________________________________________________ 3

    SinglePoint Market Research Portal Overview _______________________________ 5

    Custom Content Integration ______________________________________________ 7

    Search Technology Architecture History ____________________________________ 9

    Search Technology Architecture Overview__________________________________ 11

    Documents _______________________________________________________________________ 11

    Metadata and Metatags ____________________________________________________________ 11

    Data Model_______________________________________________________________________ 12

    Data Collection and Web Crawling ___________________________________________________ 13

    Search Service Description __________________________________________________________ 13

    Query Database __________________________________________________________________ 13

    Automatic Classification___________________________________________________________ 14Northern Light Taxonomy__________________________________________________________ 14

    Sample Taxonomy: Aerospace____________________________________________________ 15

    Sample Taxonomy: Construction__________________________________________________ 16

    Indexing________________________________________________________________________ 21

    Query Service ___________________________________________________________________ 21

    Searching_______________________________________________________________________ 22

    Relevancy Ranking _______________________________________________________________ 23

    Custom Search Folders ____________________________________________________________ 24

    Session Management______________________________________________________________ 24Security ________________________________________________________________________ 24

    Alerts __________________________________________________________________________ 25

    Applications Development and Hosting ____________________________________ 26

    Northern Light Technology Awards _______________________________________ 27

    Copyright 2003, Northern Light Group, LLC, All rights reserved2

  • 7/29/2019 SP_WhitePaperof Northern Light

    3/27

    Background

    Northern Light was founded in 1996 with the objectives to: (i) unify all of the best content in the

    world into one database, (ii) build search technology that allows searchers to easily find the mostrelevant (not just the most) information within that database, (iii) build search technology that

    works for both the first time untrained user and the information professional, and (iv) create a set

    of tools to allow Northern Light to build and operate custom information solutions for businesses

    that utilize these capabilities. This mission was formed from the following observations:

    Most interesting questions have relevant content from many sources, i.e., the Web, news

    feeds, licensed research, journal archives, and internal corporate information.

    Given the penetration of Internet technology on corporate networks, there is no longer any

    technical or distribution barrier to making all digital information available from any desktop

    computer.

    Unstructured information is overwhelmingly the most common type. It is impossible in any

    large search application to know what the organization, metatags, or document structure will

    be in advance.

    People in all walks of life are search engine literate and use search engines to access

    information on a daily basis. There is no other corporate application that requires less

    training or support than a search application.

    In particular, the key problem for search engine information retrieval is that of producing a

    precise set of relevant documentsfewer good documents, not more useless ones.

    To accomplish its mission, Northern Light set out to meet the following goals:

    Build a continually growing stable of content integration technology that would allow Northern

    Light to create databases of greatly diverse content,

    Make use of the best existing technology and develop new technology for highly scalable and

    precise searching,

    Develop highly scalable automated classification and related technologies to use pre- and

    post-search because even the best query interpretation and relevancy ranking are frequently

    inadequate to answer an information need expressed in one or a few words against a

    database of a billion documents or more, and

    Copyright 2003, Northern Light Group, LLC, All rights reserved3

  • 7/29/2019 SP_WhitePaperof Northern Light

    4/27

    Become extraordinarily proficient at dealing with unstructured data, bringing accessibility,

    usability, organization, and classification to arbitrarily large and diverse content sets from

    thousands of sources.

    NorthernLight.com formally launched in August 1997, and was the first Internet search engine to

    offer access to both Web, published, and internal corporate content in a single database.

    Copyright 2003, Northern Light Group, LLC, All rights reserved4

  • 7/29/2019 SP_WhitePaperof Northern Light

    5/27

    SinglePoint Market Research Portal Overview

    Most large companies license studies, analysis, and commentary from dozens of market research

    firms, trade journals, periodicals, equity analyst investment reports, and newswires. When there

    is the need to learn about a market or a product or a competitor, searching all the licensed

    sources is a major problem. How does the researcher, who may be a marketing, sales, or

    product development professional rather than a librarian, search dozens of market research and

    other sources? Does he or she log in dozens of times with dozens of different user names and

    passwords? Learn dozens of user interfaces on dozens of different Web sites? Run dozens of

    searches? Manually collate dozens of results lists?

    Another challenge confronting organizations that license content from many sources is the

    utilization of that content. Most of us are very impatient with search processes as a result of use

    of Internet search engines like Google. As end-users, we know that it is technically possible to

    have all the relevant information in a single database. Individuals will simply not preserve to

    search several sources. The most popular and obvious one will be searched, then maybe one

    more, on rare occasion three sources might be searched. The fourth through the hundredth

    licensed services are rarely consulted.

    Northern Light SinglePoint offers a more efficient and effective way to use all of a companys

    licensed content with one login, one search, one integrated results from all licensed sources.

    Companies can even include search of internal market research and studies so users can access

    relevant material produced within the organization while simultaneously looking at external

    sources. Using SinglePoint, it is also possible to include Web content in the database. (Is any

    search today complete without consulting relevant Web sources?) For example, vertical search

    of competitors websites, trade association websites, and government regulatory agency websites

    are highly relevant to many marketing and product planning decisions.

    The outcome: complete research from many sources, in the same time it now takes to search just

    one source. How much is the time of every professional employee in an organization worth?

    Also, the utilization of licensed content soars when a SinglePoint application is deployed.

    Sources are used based on their relevance, not on how familiar the staff is with them.

    The salient features of SinglePoint include:

    Content: All licensed, internal and Internet content from any source, in any format, from any

    organization anywhere in the world. Whatever the sources are, we can put them into one

    integrated index.

    Copyright 2003, Northern Light Group, LLC, All rights reserved5

  • 7/29/2019 SP_WhitePaperof Northern Light

    6/27

    Organization: All sources indexed and subject classified to a consistent standard

    User interface:

    - One login

    - One user interface

    - One search

    - One results list

    Seat management: Enforce access privileges by user or group to sources, groups of

    sources or individual documents.

    Outsourced turnkey solution: Northern Light develops, hosts and maintains the

    SinglePoint to help keep overhead down and minimize impact on your internal IT resources.

    Security: Private database and private Web server via secure API, VPN or T1. User name

    and passwords can be used, as well as IP validation. Security is customized to meet specific

    client requirements or integration with corporate network security systems is available.

    Copyright 2003, Northern Light Group, LLC, All rights reserved6

  • 7/29/2019 SP_WhitePaperof Northern Light

    7/27

    Custom Content Integration

    The heart of a SinglePoint application is custom content integration. It is not unusual to integrate

    dozens, even scores, of sources into a SinglePoint market research portal.

    The integration normally involves these steps:

    Determine the licensing arrangements of our customer with each of the sources. Northern

    Light has to understand whether the content is available on an enterprise-wide flat rate or is

    limited to a certain number of seats or limited in some other way. Most research vendors

    offer a wide variety of options in their offerings, and Northern Light has to understand which

    options have been selected by our customer.

    Work with the market research vendor to understand the structure of the vendors contentand network, and to determine how to acquire it to build our customers index. Vendors most

    often today use an http-based extranet that we can crawl (with the vendors help getting

    through the firewall). Alternatively, FTP file transfers are a common technique for acquiring

    the content. Northern Light will then set up automated processes to acquire exactly what our

    customer has licensed from each vendor.

    Write filters to convert the vendors content to the Northern Light load format. This involves

    determining how to capture the vendors metadata (or metatags) so that the metadata can be

    included in the index (and hence available to select, sort, and filter results).

    Determine the login required for the vendor to fulfill document requests and setting up our

    transaction system to be able to transmit documents from the vendor site to the browsers of

    the end-users of our customer.

    Determine if any internal content is intended for the database. This can be internal market

    studies, MS PowerPoint presentations, or locally held copies of licensed market research. If

    internal content is to be included, crawling or FTP file transfers procedures must be arranged.

    For small volumes of internal content, Northern Lights Automated Submission and Publishing

    (ASAP) system can used to easily publish documents to the database.

    Determine if any Web content is desired for the database. Popular choices include

    competitors websites, trade association websites, and government regulatory agency

    websites. Subscription websites and e-journals can be included in the database, e.g., trade

    journal sites or industry publication sites.

    Copyright 2003, Northern Light Group, LLC, All rights reserved7

  • 7/29/2019 SP_WhitePaperof Northern Light

    8/27

    Index and classify the content, creating the comprehensive multi-vendor index for our

    customer to search.

    Load the content every day, create the index, and serve the end-user queries. Northern Light

    disposes of our copy of the vendors content (or of internal content) a few days after the load

    process is complete. (We do maintain a copy for a few days so we can re-index a recent

    load if there turns out to be a problem of any sort.) Note that a SinglePoint database index is

    unique to a specific customer, facilitating security and usability.

    Once the comprehensive multi-vendor/multi-source database index has been built, end-users

    may query from UIs provided on their intranet or from a UI we can provide as a private website.

    Results returned are from all the content in the database across all of the licensed vendors,

    consistently relevance ranked and indexed. When an end-user wants a document, he or she

    clicks on the link just like any other search engine. We then instantly, transparently, and

    automatically authenticate that the end-user has rights to the document, log-in the end-user in to

    the vendor-in-questions service, fetch the document, and put the document in the browser

    window of the end-user. The transaction system of the vendor records the event as if the end-

    user had logged in.

    Below is a diagram of the content integration process.

    Northern Light Customer

    Copyright 2003, Northern Light Group, LLC, All rights reserved8

    W s

    User

    Interface

    Integrated

    IndexCrawler

    Trash

    Internal

    Repository

    Internal

    ReportsMkt.

    Research

    Investment

    reportsIndustry

    ebsiteMkt. Research

    Newswires &

    journals

    Full-text content

    Computer

    generated data

  • 7/29/2019 SP_WhitePaperof Northern Light

    9/27

    Search Technology Architecture History

    Northern Lights initial strategy was to provide a service to site visitors that would generate

    revenue from advertising and from sales of licensed business information library known as the

    Special Collection of over 7000 journal and periodicals sources. In order to offer this service,

    Northern Light had to master the Web and hundreds of publisher formats, and integrate all of the

    content into one database, normalize it from an indexing and classification viewpoint, and search

    it all with one query. The total database approached 400 million documents and it still may be the

    largest business research library ever assembled.

    The service (the search engine then available at www.NorthernLight.com) was warmly received

    by existing consumers of high-quality, fee-based information: professionals in corporations,

    educational institutions and governments. As a result, a marketing effort was directed toward the

    enterprise market through high volume sales to organizations.

    Northern Lights unique focus on both Web and non-Web data coupled with its ability to sell non-

    Web documents made it of interest to organizations that wanted to be directly involved in offering

    their own data or services. As a result, the initial strategy was combined with a gateway

    partnership component that typically paired Northern Light with other organizations in the offering

    of co-branded, specialized search Internet-based sites.

    Northern Lights enterprise strategy now includes a suite of customized search solutions based

    on the Northern Light technology platform that creates value for a large range of organizationsand situations. These solutions include:

    Custom intranet portals that feature search of customized content sets of licensed, internal,

    and Web sources

    Search-based services for extranets,

    Hosting and sale of archival documents for publishers,

    Search-of-site for Web-based services, and

    Full-scale custom information products (e.g., for the U.S. government), etc.

    These information management products reflect the companys core competencies in search

    technology, classification and taxonomy development, and integration of diverse content and

    federated search. Northern Light typically offers these services on an outsourced, ASP basis.

    Copyright 2003, Northern Light Group, LLC, All rights reserved9

  • 7/29/2019 SP_WhitePaperof Northern Light

    10/27

    However, the Northern Light search technology is also available as licensed software for in-house

    customer use on Solaris and Linux platforms.

    All of Northern Lights solutions derive from the original vision expressed in our Web search

    engine of one database that could provide access to all the worlds useful information. All of our

    solutions share these characteristics

    Scale to gigantic numbers of documents.

    Speed to efficiently search such large databases.

    Precision, or relevance ranking, to make the large databases useful.

    Classification to automatically organize the body of unstructured content in useful ways.

    Copyright 2003, Northern Light Group, LLC, All rights reserved10

  • 7/29/2019 SP_WhitePaperof Northern Light

    11/27

    Search Technology Architecture Overview

    Documents

    The basic unit of the Northern Light database is a document. Documents are most normally text

    based, even though they appear to the database just as objects with a URL so they could be any

    media. Each document is viewed as a multi-dimensional object that may have one or more

    values for a number of different fields, attributes, dimensions and/or domains (terms used

    interchangeably).

    One such special field is the display-object the viewable document itself, generally stored in the

    form it was delivered to Northern Light. The display object is generally not retained by Northern

    Light unless there is an arrangement to do so. Most often, the display object resides where it was

    crawled, and the URL in the Northern Light database points to it there.

    Metadata and Metatags

    Other fields, or values for all other fields - title, author, creation date, subject, source, etc. are

    generally referred to as metadata,data about the document. The term metadata sometimes

    applies to the actual values a document may have for each of these fields, as well as to the

    named fields themselves. Certain metadata is required for any document, including a display

    object, title, etc. Metadata is also used for Custom Search Folders and other classification-based

    browsing and searching, and includes key attributes such as subject, type, source, language and

    region.

    This metadata is sometimes present in the document itself (in which case the metadata might be

    called metatags) and sometimes it is generated by the Northern Light technologys auto-

    classification capabilities. Unique metadata of custom data types, e.g., internal documents, may

    be captured by the content loading filter so that they will always be present for use in

    classification and searching.

    A document can have multiple values for a single field e.g., multiple subjects, multiple authors.

    All metadata is represented in the database index, which makes it available for filtering,

    organizing, and sorting at query time.

    Copyright 2003, Northern Light Group, LLC, All rights reserved11

  • 7/29/2019 SP_WhitePaperof Northern Light

    12/27

    Data Model

    The standard metadata used for Custom Search Folders subject, type, source, language and

    region is treated specially in a number of ways. The possible values for each of these fields

    have been defined and comprise a taxonomy or set of possible values for that field/domain.

    These taxonomies are all hierarchical, though they contain many cross-references; a given value

    may have more than one parent because each taxonomy is actually a directed acyclic graph.

    The subject taxonomy contains approximately 17,000 values (referred to as nodes), starting at

    the top level with broad categories such as humanities, and sometimes going eight or more

    levels deep in certain areas to provide very specific subject values such as works of W.H.

    Auden or robotics.

    The type field refers to the kind of document an article (the default and most populated type),

    a review (with more specific typing of book review and others), an editorial, a letter, a report,

    something for sale, etc.

    The source field refers to where the document came from, and is either a Web source of some

    kind (e.g., a Web site, or possibly higher level source node such as all commercial sites) or a

    Special Collection source typically a single journal or book title at the lowest level (e.g., The

    Economist, or the Boston Herald) or, again, a higher level aggregate (e.g., journals and

    magazines, news articles, etc.).

    The language field is the predominant language(s) of the document currently one of English,

    French, German, Spanish, Italian and unknown (i.e., some other language).

    The region field specifies a location or locations referenced in the document a city, country,

    geographic region, etc.

    For type, language and subject, the metadata value(s) attempt to capture what the document is

    really about (or substantially written in, in the case of language). Multiple values are possible but

    these are intended to represent true multi-subject documents. In the case of region, the model is

    slightly different; a document will be tagged with any and all regions that can possibly be

    identified with a document. The difference between region and these other fields is the way they

    are used in searching.

    This multi-dimensional model is in contrast to a single dimensional model that must rely on

    repetition within a single domain in order to achieve comprehensive document descriptions. For

    example, in a single dimensioned design, values like reviews or biographies could be repeated

    under all or a very large number of subject values. As the amount of metadata increases, the

    Copyright 2003, Northern Light Group, LLC, All rights reserved12

  • 7/29/2019 SP_WhitePaperof Northern Light

    13/27

    single (or few) domain model becomes increasingly complex and unwieldy. The Northern Light

    multi-dimensional model, however, can maintain multiple taxonomies easily and simply and can

    class and organize documents against them.

    Data Collection and Web Crawling

    Using SinglePoint, any content in any format located on any computer anywhere in the world in

    the possession of any organization can be put into the database. From a technical viewpoint,

    data flows into the database via crawling (if the content is on an http platform such as the Web or

    an intranet), licensed feeds, or by FTP file transfer.

    Data is converted to a standard Northern Light format that captures the document itself plus

    associated metadata, including title, date, and anything else that the customer wants to have

    captured, such as document type. Since data typically arrives in non-HTML format, part of this

    conversion involves changing the document text itself (often in tagged ASCII, SGML or other

    formats) into HTML. Northern Light has, to date, converted over 200 different data formats,

    including the older tif-wrapped PDF and documents rendered as images. (Images are

    processed with programmatic OCR to make the content available for indexing and full text

    search.)

    In the case of certain third-party content licenses specifically for one or more SinglePoint

    implementations (such as market research content from vendors such as Gartner, IDC or

    Forrester), Northern Light keeps the content only as long as it takes to create the necessary

    indexes. Once the content has been completely indexed, it is discarded, making it impossible for

    Northern Light to re-create the full-text of these content sources.

    Search Service Description

    Query Database

    Once data is placed in a standard Northern Light format, it is loaded into a Northern Light

    database. Loading may be a misleading term because the content does not actually reside in the

    Northern Light Light database. Loading refers subject classifying the documents and indexing

    them. For SinglePoint applications, the index of classified documents is unique to an individual

    corporate customer. End-users send queries to the index and the results lists are generated

    from the index. The content itself is not touched during querying, indeed, for many applications

    Northern Light actually disposes of its copy of the content after it is loaded, remembering in the

    Copyright 2003, Northern Light Group, LLC, All rights reserved13

  • 7/29/2019 SP_WhitePaperof Northern Light

    14/27

    index, of course, where we got it from so that the document can be retrieved if an end-user

    desires to read it after selecting it from a results list.

    Automatic Classification

    To deliver automated classification against a huge and heterogeneous data set, the Northern

    Light technology uses our own classification taxonomies for subject, type (e.g., article, review,

    FAQ, job listing, etc.) and other document attributes, drawing on existing taxonomies and

    supplementing them to provide comprehensive coverage for a wide range of users. An automatic

    system has also been built that uses multiple strategies (e.g., pattern extraction from training

    documents, co-location analysis, and structural elements) for classifying documents for a given

    attribute. Both the taxonomies and the automated system have been in production and supporting

    end users since August, 1997, have classified over a billion documents, and are continually being

    refined to deliver more comprehensive and precise classification and better operational

    performance.

    At this point, Northern Lights automatic classification is still the only system to ever automatically

    subject classify the World Wide Web. Performance levels have been achieved by fully divorcing

    the logical classification models from their practical implementation and creating data structures

    appropriate for rapid classification of documents against the large but relatively stable

    taxonomies, patterns, and rules that are the basis of the classification process.

    One primary use of classification information (i.e., metadata) at Northern Light Integration today is

    to organize the results (through Custom Search Folders) of a search by appropriate attributevalues. This facilitates rapid navigation and some level of automatic query refinement, while

    allowing more expert users to limit their search initially by some appropriate attribute value.

    Metadata is also used as one factor (among many) in relevancy ranking.

    Northern Light Taxonomy

    Subject classification has been designed to classify a document to the one or a small number of

    subjects from our 17,000+-term subject taxonomy that a document is truly about (vs. classifying

    to all subjects that occur in the document). The system can today subject classify approximately

    25% of random Web documents (about what human editors are able to do) at accuracy rates of

    from 90-95% using user/customer appraisals. These coverage and accuracy rates are

    significantly better for non-Web documents.. Classification coverage and accuracy have been

    realized by continually engineering and extending both known and novel technologies in light of

    specifically identified problems.

    Copyright 2003, Northern Light Group, LLC, All rights reserved14

  • 7/29/2019 SP_WhitePaperof Northern Light

    15/27

    Below are two examples of taxonomy branches, one of aerospace technology and industry. The

    other of construction technology and industry..

    Sample Taxonomy: Aerospace

    The node identifier is the ID#, and the @ sign indicates inclusion by reference of other branches

    of the taxonomy.

    Aviation & space technology ID#18340

    Aerodynamics ID#18341

    Aeronautics ID#18342

    Flight control & navigation ID#18368

    Aeronomy ID#39560

    Aerospace communications equipment ID#18367

    Aerospace materials ID#18344

    Air traffic control ID#38332

    Aircraft design & construction ID#18346

    Aircraft engines & motors ID#18356

    Aerospace propulsion ID#18395

    Jet engines ID#18357

    Rocket engines ID#18359

    Aviation instrumentation ID#18382

    Commercial aircraft design ID#18347

    Flight simulators ID#17478

    Flight testing ID#18369

    Gliders ID#18351

    Helicopters ID#18352

    Homebuilts & ultralights ID#18349

    Hot air balloons ID#18350

    Landing gear ID#18383

    Military aircraft ID#18353

    Seaplanes ID#18354

    Small planes ID#18355Astronautics ID#18363

    Space systems ID#18412

    Astrophysics ID#18366

    @Celestial mechanics (ID#14437) ID#13928

    Aviation ground facilities ID#18372

    Airport planning & design ID#18373

    Copyright 2003, Northern Light Group, LLC, All rights reserved15

  • 7/29/2019 SP_WhitePaperof Northern Light

    16/27

    Military aircraft ground facilities ID#18375

    Spacecraft ground facilities ID#18377

    Avionics ID#37848

    Civil aviation ID#37768

    @Flight control & navigation (ID#14428) ID#18368

    History of aviation & space technology ID#18378

    History of aviation ID#18379

    History of space flight ID#18380

    @National Aeronautics & Space Administration (NASA) (ID#14427) ID#10135

    @Remote sensing (ID#14429) ID#13629

    Satellite technology ID#18402

    Communications satellites ID#18404

    Space stations ID#18411

    MIR space station ID#36642Space travel & exploration ID#18413

    Space colonization ID#18405

    Spacecraft & Space missions ID#18406

    Apollo space missions ID#18407

    Gemini space mission ID#18408

    Manned spacecraft ID#39606

    Mercury space missions ID#18409

    Space Shuttle ID#18410

    Space launch vehicles & equipment ID#39607

    Space probes ID#18389

    Space safety ID#39608

    Unmanned spacecraft ID#39609

    Viking mission to Mars ID#29129

    @Telescopes (ID#14432) ID#13198

    Sample Taxonomy: Construction

    Architectural engineering ID#18323

    @Architectural design (ID#40386) ID#358

    Building acoustics ID#18324

    @Construction engineering (ID#14424) ID#18511

    @Construction management (ID#14425) ID#18512

    Heating, ventilation & air conditioning ID#18331

    @Air conditioners & fans (ID#41168) ID#14698

    @Home furnaces (ID#43116) ID#37672

    Copyright 2003, Northern Light Group, LLC, All rights reserved16

  • 7/29/2019 SP_WhitePaperof Northern Light

    17/27

    Lighting & electrical systems ID#18333

    Commercial lighting ID#14840

    Exterior lighting ID#14767

    @Lighting design (ID#42055)

    @Structural engineering (ID#14426)

    Architectural services ID#4574

    Architectural drafting ID#4576

    House plans ID#37844

    @Landscape architecture (ID#40544)

    Lighting design ID#27266

    Asbestos ID#5305

    @Asbestos exposure (ID#40679)

    @Asbestos removal (ID#40570)

    Civil engineering ID#18509Bridge engineering ID#18510

    Construction engineering ID#18511

    Building standards & codes ID#39565

    Construction automation ID#18513

    Construction management ID#18512

    Construction safety ID#6194

    @Dams, canals & waterways (ID#14473)

    Earthworks engineering ID#18518

    Fire technology ID#28435

    Combustion & flammability ID#28439

    Fire investigation ID#28448

    Fire prevention ID#28450

    Fire safety systems ID#19295

    Fire suppression ID#28441

    Geotechnical engineering ID#18520

    Earthquake engineering ID#39468

    Geo-environmental systems ID#13643

    Geosynthetics ID#13644

    Hydraulic engineering ID#18524

    Coast & Harbor engineering ID#18525

    Flood control ID#18527

    @Hydraulic cement (ID#43096)

    @Hydraulic fluids (ID#43328)

    Copyright 2003, Northern Light Group, LLC, All rights reserved17

  • 7/29/2019 SP_WhitePaperof Northern Light

    18/27

    @Hydraulic machinery (ID#41488)

    Hydraulic structures ID#18528

    Aqueduct engineering ID#37887

    Dams, canals & waterways ID#18530

    Reservoir engineering ID#18532

    Irrigation & drainage ID#13032

    Sediment transport ID#18535

    Surface water runoff ID#18536

    Lighthouses ID#18540

    @Ocean engineering (ID#14472)

    Structural engineering ID#18547

    @Mechanical behavior of materials (ID#14474)

    Structural concrete ID#18551

    Structural steel ID#18554Surveying ID#18555

    @Geographic information systems (GIS) (ID#14475)

    Photogrammetry ID#18560

    @Remote sensing (ID#14476)

    Transportation engineering ID#18565

    @Automotive engineering (ID#41482)

    Electric vehicles ID#36914

    Emission control ID#18569

    High-speed ground transportation ID#18570

    Highways, roads & pavements ID#18522

    Intelligent transportation systems ID#18574

    Marine transportation ID#19488

    Pipeline transportation ID#39656

    Railroad engineering ID#18544

    Transportation planning ID#18575

    Transportation safety ID#39610

    @National Transportation Safety Board (NTSB) (ID#40938)

    @Urban transportation (ID#41339)

    Tunnel engineering ID#18577

    Construction industry ID#26320

    Building contractor services ID#4725

    Building materials ID#39453

    @Carpentry & woodworking (ID#41390)

    Copyright 2003, Northern Light Group, LLC, All rights reserved18

  • 7/29/2019 SP_WhitePaperof Northern Light

    19/27

    @Construction machinery (ID#41486)

    @Driveway coating & construction (ID#42066)

    @Electrician services (ID#41176)

    @Fences & stone walls (ID#40582)

    Hand & power tools ID#14756

    @Home improvement centers (ID#43309)

    @Insulation services (ID#41174)

    @Landscaping services (ID#41177)

    Nonresidential Construction ID#39628

    Plumbers & plumbing supplies ID#14751

    @Bathroom fixtures & accessories (ID#41169)

    @Pool construction & maintenance services (ID#41179)

    Residential construction ID#39449

    @Roofing services (ID#41180)@Septic systems (ID#42067)

    @Underwater construction & Habitats (ID#43333)

    @Water well drilling (ID#42047)

    Facilities management ID#6270

    Floor laying, refinishing & resurfacing ID#27352

    Heating & Ventilation industry ID#26835

    @Heating, ventilation & air conditioning (ID#41445)

    House painting & wall covering services ID#14789

    Industrial equipment & Heavy machinery industry ID#26341

    @Farm equipment & Supplies industry (ID#41904)

    @Manufacturing equipment & machinery (ID#41485)

    @Turbomachinery (ID#41508)

    Lighting industry ID#26349

    Electrician services ID#14784

    Electrical supplies ID#14744

    Electrical testing & inspection ID#18674

    Electrical wiring ID#18675

    @Lighting & electrical systems (ID#41446)

    @Lamps & light fixtures (ID#40577)

    @Lighting design (ID#42054)

    Laminated wood ID#27394

    Particle board ID#27398

    Plywood & veneer ID#5068

    Copyright 2003, Northern Light Group, LLC, All rights reserved19

  • 7/29/2019 SP_WhitePaperof Northern Light

    20/27

    Pressure treated wood ID#27401

    Sheet metal ID#5124

    Wire & Cable products ID#39463

    Aluminum & aluminum products ID#5112

    Copper & copper products ID#5116

    Iron industry ID#26357

    Steel industry ID#26358

    Stainless steel ID#5125

    Paint & paint supplies ID#14748

    Property developers ID#5232

    Rock mechanics ID#39569

    Soil science & technology ID#18312

    Erosion ID#18313

    Fertilizers ID#18316Chemical fertilizers ID#17611

    Organic fertilizer ID#17612

    Soil cultivation ID#18315

    Soil pollution ID#13029

    Soil remediation ID#37889

    Stone, clay, glass & concrete product industry ID#26373

    Cement ID#5306

    Hydraulic cement ID#36456

    @Ceramics & Pottery (ID#40405) ID#797

    Concrete ID#5307

    Concrete block & brick ID#36460

    Ready-mixed concrete ID#36461

    Cut stone & stone products ID#5308

    Granite ID#36462

    Limestone ID#36463

    Marble ID#36464

    @Memorials & Grave stones (ID#42060) ID#27319

    Earthenware ID#5309

    Glass products ID#5311

    Automobile glass ID#36478

    Flat glass ID#36465

    Glass containers ID#36466

    @Mirrors (ID#40565) ID#4879

    Copyright 2003, Northern Light Group, LLC, All rights reserved20

  • 7/29/2019 SP_WhitePaperof Northern Light

    21/27

    Pressed & blown glass ID#5315

    @Sand & Gravel (ID#41514) ID#19469

    Vitreous china ID#36475

    Windows & doors ID#4998

    Indexing

    During indexing, all of the terms in the documents and metadata fields are extracted and indexed

    into appropriate index structures; searches can be resolved by using these structures and without

    having to refer to the original documents. This process is exhaustive and comprehensive; all

    visible terms in the document display object and all appropriate metadata values are indexed.

    There are no stop words or special characters that are not indexed and there is no practical cut-

    off in terms of document length at which point indexing stops.

    All search terms (including those inside quoted phrases) are viewed as nouns and transformed

    (stemmed) automatically to their common singular form during indexing (and at query time). This

    allows a search on a singular or plural noun to find occurrences of either. The stemming rules are

    fairly simple and do not handle most irregular forms, which tend to occur for very common words

    not generally useful in searching.

    All terms are indexed as all lower-case letters. In addition, all terms containing at least one

    instance of both upper- and lower-case are also indexed in a special case-sensitive index. This

    allows queries to find all instances of a term regardless of case; query terms are also translated

    to all lower-case for initial query resolution. This also allows a match in case-sensitivity with aquery term to be used as a relevancy factor.

    The above rules are English-language dependent. However, given the symmetry with which they

    are applied at indexing and query time, they generally preserve appropriate search processing for

    all languages. Double-byte language support, which requires the licensing of an third-party

    language processing module from Teragram, includes language-sensitive stemming and other

    sorts of processing.

    Numeric tokens (or tokens of mixed letters and numbers) are indexed as text. Proximity

    information can be represented in indices in various ways, allowing either very fast access of

    short phrases, or complete and precise (but slower) access of phrases of unlimited length.

    Query Service

    Finished databases are connected to Northern Lights Northern Light network by the Query

    Listener (QL). The QL accepts queries from external clients (such as a Web server) and passes

    Copyright 2003, Northern Light Group, LLC, All rights reserved21

  • 7/29/2019 SP_WhitePaperof Northern Light

    22/27

    them to the Query Server (QS). The QS translates the search syntax and other parameters,

    queries the database indices, and returns the appropriate citation information and metadata. The

    Query Listener is also responsible for identifying itself by broadcasting, via UDP, a database

    identifier and load information. Clients use this information to select a listener appropriate to their

    mission.

    Searching

    Nearly all search fields on all search forms accept and process searches in the same way. The

    query interpretation algorithm proceeds as follows:

    1. If the query is well-formed Boolean expression, it is rigorously interpreted as such.

    Boolean expressions can contain AND, OR, NOT, simple terms (words), quoted

    phrases, wildcards and parentheses including sub-expressions, and may contain an

    unlimited amount of nesting. In addition, a Boolean expression may itself contain any

    number of fielded sub-expressions that specify a search against a particular metadata

    field, e.g., (lawsuit or sue) and title: microsoft or netscape. By default, terms are

    searched against the text field, which includes all full text and all document metadata.

    This field may also be specified by use of the text: keyword. Search terms may also

    include one or more trailing or multi-character wildcards (indicated by *) or single-

    character wildcards (%) as long as there are at least four non-wildcard characters before

    the first wildcard, e.g., rachm%inof*. Search fields are available to the user through

    appropriate fields on search forms or through the field: syntax. Other special fields

    include relational operators that can be used for date fields, sort: date (a reverse

    chronological sort), or sort: relevancy (the default).

    2. If a well-formed Boolean expression is not found and the query is more than a specified

    length (currently 12 terms), a statistical query evaluation process is used. This only

    requires the presence of a single term in the query for a document to appear on the

    results list, but makes use of all terms or phrases appearing in a document to determine

    the best documents. This statistical evaluation can also be forced on a query of any

    length by preceding the query with the pseudo-field like:, e.g., like: side effects of anti-

    depressants and sedatives

    3. If neither of the above two conditions is met, then a query with any use of the + or -

    operators common among Internet search engines will be interpreted according to

    generally accepted rules. The rules are that any term or quoted phrase immediately

    preceded by a + must be in a document to appear on the results list, and any term or

    Copyright 2003, Northern Light Group, LLC, All rights reserved22

  • 7/29/2019 SP_WhitePaperof Northern Light

    23/27

    quoted phrase preceded by a - cannot be in any document for it to appear on the results

    list. Other terms in the query are considered desirable but not required.

    4. If the query does not meet any of the above criteria, a fuzzy search is performed. This

    does an implicit AND of most content-bearing words (or what are generally non-content

    bearing words if those are the only query terms) but uses all terms entered for relevancy

    ranking purposes. Some limited natural language analysis is also performed on terms,

    such as recognition of the word not.

    All query terms are presumed to be nouns and are translated, if necessary, to their singular form

    using fairly simple algorithms. This allows a match against either a singular or plural form, since

    all document terms are similarly converted to singular form during indexing. Query terms are also

    translated to lower case in order to be able to match any form of the word in any document; all

    document terms are similarly converted to lower-case at indexing time. Mixed case terms are

    searched against a special mixed case index to provide information about case-sensitive matches

    for relevancy ranking.

    Relevancy Ranking

    One of the strengths of Northern Lights technology is its advanced relevancy ranking algorithms.

    These not only provide a novel approach to ranking but are based on highly optimized index

    structures and algorithms that allow Northern Light Search and Content Integration to perform

    significant relevancy ranking operations on a very large database.

    Ranking takes into account several different factors, each of which contributes weight to a

    documents overall relevancy score and to its eventual placement in the results list. A maximum

    theoretical relevancy score is calculated for every query, and displayed relevancy scores

    represent a simple transformation to a 1-99% range of the actual document score as compared to

    the maximum theoretical score. These factors include the following:

    Number of occurrences of matching terms (term frequency factor, or TF).

    Relative frequency of those terms in the entire database (term inverse document frequency,

    or IDF).

    Implicit phrase recognition

    Location of matching terms and phrases in document metadata

    Copyright 2003, Northern Light Group, LLC, All rights reserved23

  • 7/29/2019 SP_WhitePaperof Northern Light

    24/27

    Number and authority of external sites linking to this document (applies to Web documents

    only)

    Date of the document (All other things being equal, more recent documents are considered to

    be more relevant than older documents.)

    Classification metadata

    Document length

    Presence of any detectable spam

    Custom Search Folders

    For any result set containing more than 25 results, Northern Light determines a set of Custom

    Search Folders (CSFs) before returning the results to the user. To do this, Northern Light

    examines the metadata values of the documents on the results list, uses those values to

    determine appropriate CSFs, weighs each of the CSFs and then displays the top-weighted CSFs.

    Weighting of CSFs is determined by a number of rules that contribute different values to the

    overall weighting of that CSF. Certain rules assign weights based on how many documents are in

    the CSF, or how many of its documents rank high on the results list. Other CSFs assign values

    based on how different the CSFs are from other candidate CSFs, or based on the more exact

    nature of the metadata values involved.

    Session Management

    Northern Lights Northern Light service uses a proprietary session state management system to

    store user state between page requests or other transactions. Each user is given a cookie with a

    unique token that contains no outwardly useful information. The users browser transmits the

    token in the headers of each request, and the Northern Light software uses the token to retrieve

    session data.

    Security

    Northern Light can invoke a variety of security solutions that the option of the customer. If a user-

    name password scheme is desired, we provide an administrative user-interface for managing the

    passwords. IP validation can be used in lieu of username/password security, or in addition to it.

    Secure https protocols are customarily used with Verisign certificates insuring the validity of

    Copyright 2003, Northern Light Group, LLC, All rights reserved24

  • 7/29/2019 SP_WhitePaperof Northern Light

    25/27

    connections, or leased T1 lines can be used for extreme security. In our 7- year history, Northern

    Light has never experienced a single hacker intrusion into customer applications or our network.

    Alerts

    Northern Light offers users and enterprise customers the ability to save any search and have it

    run automatically whenever the database referenced by the search is updated. At that time, an e-

    mail is sent to the registered owner of the alert if and only if there are new documents in the

    database that meet the search criteria; the e-mail message contains a link to just these new

    results. The system further keeps track of when a user has actually accessed these new results

    so that, if a user receives a string of alert e-mails before being able to see any of them, accessing

    any one of them will provide all the new results since the users last access; it is unnecessary to

    cycle through the alert messages one at a time in order to see all new results.

    Copyright 2003, Northern Light Group, LLC, All rights reserved25

  • 7/29/2019 SP_WhitePaperof Northern Light

    26/27

    Applications Development and Hosting

    Northern Light offers complete services for producing, or letting customers produce, custom

    search applications to be run either in ASP mode, in-house at a customer site, or some

    combination of the two. These can include documented APIs for searching, customized results

    lists, alerts and other capabilities (usually through XML interfaces). Northern Light also has a

    dedicated applications group, efficient tools for rapid development of user interfaces and, behind

    it all, a 7x24 secure operations facility.

    Copyright 2003, Northern Light Group, LLC, All rights reserved26

  • 7/29/2019 SP_WhitePaperof Northern Light

    27/27

    Northern Light Technology Awards

    Top 100 eContent magazine, December 2001

    Best of the Web US News and World Report, October 2001

    Top 100 Companies That Matter KMWorld magazine September 2001

    Editors Choice PC Magazine, November 2000

    Best of the Web, Forbes magazine, September 2000

    Web Business Award For Online Excellence CIO magazine, July 2000

    Best Online Business/Professional Service, Software & Information Industry Association, March2000

    Best Online Research Product, Software & Information Industry Association, March 2000

    Best Online Information Service, Software & Information Industry Association, March 2000

    Editors Choice PC Magazine, September 1999, 1998, 1997