pnd facial search accuracy evaluation foi/crime/133 18 ho pnd facial se… · database (pnd) in...

22
PND Facial Search Accuracy Evaluation HO Centre for Applied Science and Technology Author: G Whitaker V1.0 23 rd October 2015

Upload: others

Post on 22-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

PND Facial Search Accuracy

Evaluation

HO Centre for Applied Science and Technology

Author: G Whitaker

V1.0 23rd October 2015

Page 2: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

2 OFFICIAL - SENSITIVE

Contents

PND Facial Search Accuracy Evaluation .................................................................................... 1

Executive Summary ............................................................................................................................ 3

1 Introduction .................................................................................................................................... 4

2 Aim of the evaluation ..................................................................................................................... 4

3 Scope .............................................................................................................................................. 4

5 Evaluation Objectives ...................................................................................................................... 5

6 Evaluation Approach ....................................................................................................................... 5

6.1 Evaluation Methodology ........................................................................................................................6

6.2 Test dataset compilation ........................................................................................................................6

6.3 Matching Thresholds .............................................................................................................................7

6.4 Searching (Stage 1) ................................................................................................................................7

6.5 Results Recording ..................................................................................................................................8

6.6 Searching (Stage 2) ................................................................................................................................8

6.7 Results analysis and reporting ................................................................................................................8

7 Results Analysis .............................................................................................................................. 9

7.1 Probe Image Statistics ............................................................................................................................9

7.2 Search Results – Stage 1 ....................................................................................................................... 12

7.3 Search Results – Stage 2 ....................................................................................................................... 12

7.4 Results of the Automated Analysis ....................................................................................................... 13

7.5 Manual Analysis of Results ................................................................................................................... 14

7.6 Interpretation of Results ...................................................................................................................... 15

8 Conclusions and Recommendations .............................................................................................. 17

8.1 General ............................................................................................................................................... 17

8.2 Image Quality ...................................................................................................................................... 17

8.3 Comparison Thresholds and Candidate List Length ............................................................................... 18

8.4 Multiple Match Groups and Duplicate Images ...................................................................................... 19

Annex A – An example of the probe image data spreadsheet ............................................................ 20

Annex B – An example of the results analysis spreadheet ................................................................. 21

Page 3: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

3 OFFICIAL - SENSITIVE

Executive Summary

This paper describes the approach explored by HO CAST to attempt an assessment of the performance

(on the operational system) of the Facial Search capability that was implemented on the Police National

Database (PND) in 2013.

Testing on a ‘live’ system in this way is a significant challenge for reasons explained in this paper and

was only undertaken since no other formal accuracy testing had been carried out as part of the

implementation. While the results obtained with regard to search accuracy are of limited value they do

nonetheless provide an insight into the way facial images are currently stored on PND and their

suitability for use with automated facial recognition algorithms.

One of the key drivers for this work was concern being expressed over the lack of testing prior to

operational deployment of the facial search capability. However it is equally important not to lose sight

of the fact that significant operational successes are now being reported where it has been used as part

of police investigations. Moving forward the emphasis should be on ensuring that any future

developments of the facial matching capability (indeed government and law enforcement biometric

services generally) take note of the recommendations in this report.

Key Findings

It is essential that performance assessment of biometric systems be undertaken prior to

operational deployment, and the requirements governing this must be adequately covered within

the contractual arrangements with the supplier. A ‘lessons’ learned’ exercise should be

undertaken to understand why such formal testing was not included within the scope of this

deployment. Accuracy assurance should be embedded as an essential requirement in the

development and implementation of future systems which have a biometric component within

them.

This evaluation has revealed significant facial image duplication within the PND, potentially

impeding operational effectiveness, and incurring additional cost. Since PND acts as a central

repository, holding and sharing copies of information that is already stored locally, it may be

assumed that such duplication is also occurring within individual forces and that similar additional

costs are also being incurred at a local level. Alternative approaches to the storage and

management of custody image data should be explored; for example a centralised solution such

as is currently in place for fingerprints and DNA data would improve data quality and integrity,

and might also be expected to be more cost effective than continuing to maintain separate

custody image repositories within each force.

The evaluation has revealed significant variation between forces in the quality of facial images

submitted to PND (assuming the image files being uploaded are the same as those being held

locally). In general, forces are failing to ensure that their custody images comply with previously

issued national standards, both in terms of image quality and associated metadata. The

introduction of suitable (automated) facial image quality checking software is recommended, both

at the point of capture within individual forces, and also at the point of enrolment into the PND

facial search database.

This evaluation has not revealed any significant issues with the way the FR algorithm is working

or the matching threshold settings as currently applied. However the effect of the recent change

in the threshold from 0.7 to 0.65 should be carefully monitored to determine what, if any, benefits

it has brought and whether further reductions may be beneficial.

Page 4: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

4 OFFICIAL - SENSITIVE

1 Introduction

This paper describes the approach explored by HO CAST to attempt an assessment of the performance

(on the operational system) of the Facial Search1 capability that was implemented on the Police National

Database (PND) in 2013.

There is no large scale test environment for the PND facial search capability, and no formal evaluation of

search performance was carried out on the operational system prior to ‘go-live’. Evaluating biometric

search accuracy on a live system, where previously ground-truthed test data cannot be introduced into

the database (and subsequently removed), is a significant challenge.

This has necessitated an alternative approach to performance assessment. Note that this should not be

considered 'best practice' for this type of testing, but in the absence of any additional funding or

contractual obligation on the Systems Integrator (SI) to carry out such testing on a dedicated test

environment, it should as a minimum provide insights into the operation of the facial search capability as

currently implemented on PND.

2 Aim of the evaluation

Although a facial search capability has been live on PND for more than a year, there has to date been no

objective assessment of its performance. The aim of this evaluation was to provide quantitative data on

the performance of the Facial Recognition (FR) algorithm as currently installed on the operational

system when typical custody quality photographs are used as enquiry images. It was also an

opportunity to better understand how facial images are uploaded and stored on PND, and to identify

areas where improvements might be made

Although the scale of this test was limited (due to the lack of any batch capability on PND) this

information will be of value in informing further development of PND as well as other large scale

(national) HO deployments of FR technologies. The results will also be helpful in informing policy

development for future uses of facial images (capture, storage, and searching).

3 Scope

This evaluation is primarily concerned with an assessment of the search functionality, following the

enrolment of face images. Assurance of the enrolment process was addressed as part of the release

testing activities for Rel18.

However the evaluation also provides an opportunity to look at variations in image quality between

forces, to investigate how images are uploaded and stored on PND, and to highlight those areas where

changes might be made which would lead to improvements in the performance of the facial search

1 ‘Facial Search’ is the term commonly used to describe the use of a face recognition algorithm to search the custody images on PND.

Within this document Facial Search is used interchangeably with the term Face Recognition or ‘FR’.

Page 5: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

5 OFFICIAL - SENSITIVE

capability.

Excluded from the scope was operator training / expertise (which can have a major impact on end-to-

end operational accuracy) and the use of CCTV, Facebook or other similar ‘non-compliant’ facial

images. The evaluation only looked at the searching of custody images against a database of custody

images, due to the difficulties in obtaining sufficient numbers of other types of imagery with known

matches in the database.

5 Evaluation Objectives

The overarching objective of this work was to explore a methodology that could be used on the live

system to:

1) validate that the capability has been implemented correctly by the supplier

2) provide quantitative data to inform the setting of matching thresholds - to maximise the likelihood

of a match being found (if one exists in the database) whilst minimising the total number of

search results that the operator has to view;

3) provide a baseline figure for the search accuracy of automated facial recognition technology

when used to search good quality enquiry images against a database of 13M plus operational

custody images;

4) learn more about the way in which custody images are uploaded and stored on PND and to

identify areas where improvements might be made;

5) inform the user community about system capability, providing information that will help with

formulating guidance and best practice for the use of this capability, as well as future national

deployments of FR F

6 Evaluation Approach

In the absence of a dedicated large scale test environment for the facial search capability, two major

challenges in developing a methodology which would meet these objectives have been:

the lack of any 'ground truth' for the operational images on PND, coupled with an inability to

'seed' test data2 into what is now a live operational database, and

obtaining sufficient numbers of enquiry images, representative of operational data and with

known matches in the database.

2 Mixing test data with operational data is not considered good practice, as the test data could be returned in response to an operational search. Such testing should always be conducted on a dedicated test environment.

Page 6: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

6 OFFICIAL - SENSITIVE

The test approach described here attempts to overcome these challenges by making use of a 'window of

opportunity' each month during which new images have been uploaded to PND by forces but have not

yet been enrolled into the FR system.

6.1 Evaluation Methodology

New images are routinely uploaded to PND from local force custody imaging systems, and these are

then linked to the relevant POLE (Person, Object, Location, Event) data for the subject (normally but not

necessarily a new arrestee). The diagram below shows a high level outline of the process.

After uploading to PND such images will appear in the results list of a POLE search, but as they are only

enrolled into the automated FR database on a monthly basis (being held in a 'pool' of newly uploaded

images until that time) they will not initially be searchable by the FR algorithm.

These images therefore constitute a good source of enquiry (probe) data; namely operational custody

images that are not in the FR database and therefore cannot match against themselves, but a proportion

of which will be of subjects who do already have images in the database (i.e. from previous arrests).

Unfortunately there is no easy way of determining which of the images are of first time offenders and

which are of recidivists (and who probably have other images from previous arrests already in the

database). For all searches that return a recognisable 'match' this is not an issue but for those that do

not it is difficult to say whether this was because a match existed but was not found by the algorithm, or

if there was no matching image present in the database.

To address this it is necessary to repeat these searches at a later date, once the images have all been

enrolled in the FR system. This is explained in more detail later in this report.

6.2 Test dataset compilation

Testing of biometric systems requires the use of personal biometric data from real subjects; it is not

possible to synthesise operationally representative test data. Prior to starting work on creating the test

datasets, confirmation was obtained from the Information Commissioner’s Office that it would be

Page 7: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

7 OFFICIAL - SENSITIVE

acceptable to use un-enrolled operational images from police forces as probes, as described in this test

approach.

Operational facial images were then sourced from the 'pool' of images recently uploaded to PND by

individual forces. CGI (the systems integrator) was requested to extract a randomised subset of such

images (approx 2000) which were then transferred via an FTP protocol to the PND 'secure area' at HO

headquarters where they were then loaded onto a dedicated secure laptop with access to PND via the

‘Restricted’ channel.

After reviewing the data it was found that CGI’s extraction process had corrupted the images, rendering

them unusable, and a request was made for them to be re-sent. When the second set was examined it

was discovered that the overwhelming majority appeared to have originated from a single force. The

original requirement for this data was that it should be an operationally representative sample, so CGI

were asked to extract a second batch of 2000 images which proved to be more diverse, although still

only from a small number of forces.

From these 4000 images a smaller test sample was selected, chosen to be representative of the overall

data, in terms of both image quality (size, resolution, pose, illumination, expression etc) and

demographics (age, gender, ethnicity). This information was recorded on a spreadsheet (see Annex A).

6.3 Matching Thresholds

For current operational use the matching threshold for the facial recognition algorithm is set at the

default level as recommended by the supplier (Cognitec). However for testing purposes it is desirable to

have this set to a lower level in order to ensure that a sufficient number of candidate responses are

always returned from a search.

Therefore, after consultation with members of the PND team and the Systems Integrator, prior to the

start of the test CGI were requested to lower the matching threshold from the default setting of 0.7 to 0.5.

It was subsequently discovered that CGI had made the threshold change on the ‘Confidential’ channel

and not the ‘Restricted’ channel. This was not corrected until the second day resulting in the first day's

searches not returning the expected results and thus limiting their value in this evaluation.

6.4 Searching (Stage 1)

There is currently no batch capability for the facial search on PND. Therefore each image from the

chosen test set had to be manually selected as a probe/enquiry image and then a search launched. The

search was based solely on the image; with no entry of additional search criteria (such as age range)

and was against all 13M plus images in the FR database.

Note that no additional image processing (cropping, rotation, sharpening etc) was applied to the probe

images. Provided the images met the criteria of size (<500KB) and of both eyes visible on human

inspection, they were submitted for searching in the ‘as received’ condition.

Throughout the testing there were performance problems with the connection from the secure laptop to

PND which would regularly freeze and require a restart. It was not possible to identify the cause of this

but as a result the number of searches that could be launched in the time available was significantly

reduced.

Page 8: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

8 OFFICIAL - SENSITIVE

6.5 Results Recording

The results of all 'successful' searches were recorded, success being defined as a completed search

that returned one or more results above the threshold. With a very low threshold in place for the test all

searches should return results which could then be viewed on line as well as being recorded in the

electronic 'results log' provided by CGI after all the searches were completed.

Note - Match Groups

PND assigns data relating to the same individual into a 'match group'. This also applies to images

where there is sufficient additional metadata to link the image to an existing match group. However in

many cases there is not sufficient information in the record that has been uploaded by the force, and as

a result images relating to a particular individual (in some cases possibly the exact same image) may be

allocated to different or even new match groups.

The effect of this is that the same individual may be returned in multiple positions in the respondent list.

However, only one of these match groups will contain a copy of the probe image, and it is the position

and score of this match group that matters when recording the result of the search.

This may result in some true matches being reported as 'misses' but this is an inevitable consequence of

the way PND uses match groups as well as not having any reliable ground truth against which to assess

the result

6.6 Searching (Stage 2)

After all the searches (Stage 1) had completed, CGI were asked to enrol them into the FR database, and

they were then resubmitted. However this time round, the probe was expected to always match against

itself, returning a score of '1' (maximum score), along with details of the match group to which the image

had just been added.

In effect, this second search established the 'ground truth' for the data, although as stated in the box

note above this may not be 100% reliable due to the way match groups are used on PND.

6.7 Results analysis and reporting

Once both sets of searches had been completed (Stages 1 and 2) CGI was asked to provide the results

spreadsheets, listing the Search ID, Match Groups and Image Scores for each search image both before

and after the images were enrolled into the FR system. These were then cross referenced (using the

Search IDs) with the author’s Enquiry Image spreadsheet, linking each probe image to the results

obtained both before (the genuine search result) and after that same image was enrolled (the ‘ground

truth’)

For each probe image the match group containing the highest scoring image returned by the first search

was compared with that from the second search and if they were the same then the result was declared

to be a match. The score for the second search was expected to always be a '1' (as it will have been

matched against itself) while the score from the first search is recorded as the genuine match score for

that search.

Page 9: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

9 OFFICIAL - SENSITIVE

If the match groups were not the same there are two possibilities

1. At the time of the first search there was no matching image in the database to be found. After

enrolment the image was assigned to a newly created match group. This result is therefore

considered to be a genuine non match as there was nothing for the probe to match against the

first time round.3

2. There is more than one match group containing images of the probe subject, and the one

returning the highest scoring response differs from the one to which the image was actually

added. If these match groups genuinely relate to the same individual the result should be

recorded as a match but this information will not be evident solely from the results

spreadsheets.

In both the above cases it was acknowledged that a manual check of the results would be required to

determine what had happened but, depending on the numbers, it might only be possible to check a

subset of the searches.

7 Results Analysis

7.1 Probe Image Statistics

Before considering the search results it is useful to consider the breakdown of probe images by subject

gender / ethnicity / file size / quality etc as this provides an insight into the type of data currently being

uploaded by forces to PND.

Distribution by Ethnicity and Gender:

Number of Images

Percentage

Male Caucasian 124 58

Female Caucasian 50 24

Male non-Caucasian 20 9.5

Female non- Caucasian 18 8.5

Total number of images 212 100%

3 Note that on the operational service, whether this would be reported as a true non-match or a false match will depend on where the

threshold is set. For this evaluation the threshold was lowered to ensure that all searches would return results.

Page 10: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

10 OFFICIAL - SENSITIVE

Distribution by file size:

Image dimensions and file sizes varied considerably between forces. The smallest file size included in

the test was 7KB while the largest was 185KB. The smallest images were 200 by 200 pixels and the

largest were 1280 by 1024.

0

10

20

30

40

50

60

Number of

images

Size in KB

Probe image file size distribution (in KB)

0

10

20

30

40

50

60

70

<50 <100 <150 <200 <250 <300 <350 <400 <450 <500 >500

Numberof

images

Image size (number of pixels x 1000)

Probe image size distribution (pixels)

600 x 480 pixels

Page 11: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

11 OFFICIAL - SENSITIVE

Other statistics:

Number of Images

Percentage

No. subjects wearing glasses 8 4%

No. subjects with facial hair 39 18%

Non compliant pose 15 7.5%

Estimated age distribution:

Juvenile (<18) 4 2%

Young (18-35) 93 44%

Middle aged (35-60) 98 46%

Old (>60) 17 8%

Contrary to expectations it was noted that obtaining a compliant pose was not a major issue in most

cases and subjects were generally front facing and with a neutral expression. Approximately 7% of

cases exhibited significant deviation from this; typically these were images where the eyes were closed

or the mouth was open. In a few cases cuts or bruising was also evident on the face.

Lighting was an issue in just over a third of the images. Typically this was due to uneven illumination

across the image, resulting in strong shadows on parts of the face. In other cases it appeared that no

special lighting had been employed and that the only illumination was from standard fluorescent strip

lights.

Image quality is known to be the biggest single factor influencing the performance of FR algorithms.

While the above figures on file size, lighting etc are useful indicators of quality they are not sufficient in

themselves. For example in many of the larger images (which might have been thought to be of better

quality) the subject often occupied only a very small part of the image; if these were cropped along the

lines of a conventional custody image the actual facial image size would be towards the lower end of the

size scale.

While many of the images initially appeared acceptable on the laptop screen, when enlarged their

shortcomings became very apparent. Many were very ’soft’; either due to the faces being out of focus or

perhaps a result of poor quality or dirty lenses on the cameras. A significant proportion also exhibited

JPEG artefacts as a result of high levels of data compression.

The following chart shows the image quality as judged (subjectively) by the testers. Note that an image

that fully complied with the existing police image standard would score a 10; NONE of the probe images

used in this test fully achieved this level of quality.

Page 12: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

12 OFFICIAL - SENSITIVE

7.2 Search Results – Stage 1

Total number of searches launched: 212

Number of searches not returning any result for Stage 14: 10

Number of searches returning any result for Stage 1: 202

Number of searches returning a result in Stage 1 with a score of less than 1: 134

Number of searches returning a result in Stage 1 with a score of 1: 68

A key assumption of the test was that images returning a match with a perfect score of '1' must have

matched against an exact duplicate image that was previously enrolled in the FR database. Such

images cannot be used as part of an accuracy evaluation and so were excluded from the subsequent

performance analysis.

7.3 Search Results – Stage 2

Following enrolment into the FR database, it was expected that when the probe images were launched a

second time for a search against images in the database, they would all match against the copy now

enrolled in the database and, being exact duplicates, would return a matching score of 1.

4 The Cognitec software does not provide a reason for specific failures to match. If a search returns no results above the matching threshold it will be marked as having failed; there were examples of this on Day 1 as the matching threshold had not been lowered as requested. Others may have failed for reasons such as very poor image quality or because the eyes could not be located by the software (e.g. due to the subject wearing glasses). One search failed to generate a transaction ID and was therefore excluded from Stage 2.

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8 9 10

No.of

Images

Quality Score

Distribution of Image Quality

Page 13: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

13 OFFICIAL - SENSITIVE

Total number of searches launched: 212

Number of searches not returning any result for Stage 25 15

Number of searches returning any results for Stage 2 196

Number of searches returning a match in Stage 2 with a score of less than 1: 21

Number of searches returning a match in Stage 2 with a score of 1: 175

21 searches returned a score of less than the expected '1' and these were manually examined to identify

possible reasons. It was discovered that all of those with a match score greater than 0.99 (a further 10

images) had nonetheless matched against themselves and were thus genuine duplicate images despite

not scoring a ‘1’.

The remaining 11 images were confirmed as being genuine non-matches (the scores for these ranged

from 0.67 up to 0.85). It may be that they had not been enrolled in the FR database as anticipated, even

though they were all of reasonable quality and had all been previously accepted by the system as

probes. However it has not been possible to verify this.

7.4 Results of the Automated Analysis

A key objective of this work was to assess the feasibility of developing a largely automated method for

evaluating search accuracy on the live PND system.

The following sections describe the approach. Note that only 1st position respondents were considered

(based on the highest scoring image that was returned for each search).

Using the Probe Image ID as a common key between the two sets of data the results of the Stage 1 and

Stage 2 searches were combined. This resulted in a total of 189 ‘usable’ searches (i.e. searches

returning results for both Stage 1 and Stage 2 where the results could be cross-compared).

For those cases where a duplicate copy of the image had been found, it was assumed that both

of the following would be true:

Stage 1 comparison score =1 (i.e. the probe had matched against a duplicate image)

Stage 2 comparison score =1 (i.e. the probe had matched against itself or the previous duplicate image)

If both of these conditions were true then the probe image was considered to be a confirmed

duplicate. 68 images were found to have returned a score of 1 in Stage 1, while 175 of those in

Stage 2 scored 1. However only 49 images were found to have returned a score of 1 in both

Stages 1 and 2.

For those cases where a ‘true’ match had been found, it was assumed that the following would

be true:

Stage 1 comparison score <1 (i.e. the probe had not matched against a duplicate image)

Stage 2 comparison score =1 (i.e. the probe had matched against itself)

5 The expectation was that all images that returned results for Stage 1 would also return a result for Stage 2. It is not clear why this figure

for Stage 2 is higher than it was for Stage 1.

Page 14: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

14 OFFICIAL - SENSITIVE

Highest scoring Match Group in Stage 1 is the Highest scoring Match Group in Stage 2

Image ID in the Match Group for Stage 1 is not the Image ID in the Match Group for Stage 2

Image Count within the Match Group in Stage 2 is higher than the Image Count for that Match Group in

Stage 1

If ALL of these conditions were met, the result was considered to be a true match, confirmed by the

fact that the probe image (once enrolled) had been added to the same Match Group as that found in

the Stage 1 search.

Analysis of the data revealed that only 20 searches met all of these criteria.

For those cases where there was no corresponding image in the database to be found, it was

assumed that the following would be true:

Stage 1 comparison score <1 (i.e. the probe had not matched against a duplicate image) OR no result

was returned (i.e. no image in the database scored above the matching threshold)

Stage 2 comparison score =1 (i.e. the probe had matched against itself)

Highest scoring Match Group in Stage 1 is not the Highest scoring Match Group in Stage 2 (i.e. the image

had been added to a different match group to the one that returned the highest scoring image in Stage 1).

If ALL of these conditions were met it meant that the probe image had been put into a new Match Group,

the implication being that no matching image existed in the database at the time of the first search.

Analysis of the data revealed that 101 searches met these criteria.

For the remaining images there are several possibilities; some may have been duplicates where one or

both searches had not returned a score of 1 as expected. In other cases it may be that that the probe

image was added to a different match group to the one it (correctly) matched against the first time round.

This can happen due to insufficient or inaccurate metadata associated with the image.

Note that there were also 10 cases where the comparison score for Stage 2 was lower than that seen in

Stage 1. It is not clear how this could happen since even if the image had failed to enrol (thereby

explaining why the score in Stage 2 was not ‘1’), it should still have matched against the same image

that was found the first time round, and with the same score. One possibility is that the image returning

the original match had since been removed from the database but there is no easy way to determine if

this was actually the case.

Examples of the data returned from both the Stage 1 and Stage 2 searches is provided in Annex B

7.5 Manual Analysis of Results

In order to validate the results above it was acknowledged that it would be necessary to manually verify

the results of at least a proportion of the searches.

During the test process and as the results were reviewed it became clear that some of the assumptions

on which this evaluation was based were not valid and that the figures given above were therefore

questionable; some of the duplicate images were scoring less than ‘1’ and in many cases matching

images had been added to different match groups from the ones expected and as a result were not

being recorded by the automated analysis.

The results of all of the Stage 1 searches were therefore manually examined to try to determine visually

whether or not they had returned a true match.

Page 15: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

15 OFFICIAL - SENSITIVE

It is important to note that human comparison of unfamiliar faces is difficult, even for experts trained in

this field and working with high quality, high resolution images. In some cases, although the actual

images were different the subject’s clothing and the custody environment were clearly the same, and

making a decision was relatively straightforward. In others, multiple images of the potentially matching

subject were returned and these were used to help confirm the decision, but inevitably the results in this

section are a subjective assessment and must be treated with caution.

Of the total of 211 original probe (enquiry) images,

70 probe images were determined to have matched against exact duplicates in the database

before addition of the probe image in Stage 2, regardless of whether or not they were eventually

both included in the same match group;

56 probe searches were determined to be true matches to other, pre-existing, images of faces in

the databases, regardless of whether or not they were eventually added to the same match

group.

43 probe images were determined to be genuine non-matches (i.e. a new match group had been

created for them and they did not appear in any of the other match groups returned by the search

For the remaining images it was impossible to reach a conclusion.

In summary, out of the initial 211 searches, the automated facial search of PND identified just 20 true

matches, whereas visual examination by the tester identified a total of 56 matches. For declared non-

matches, the visual examination identified 43 images, whereas the use of the automated facial search of

the database resulted in 101.

7.6 Interpretation of Results

The large discrepancies between the automated and manual analysis indicate that this methodology is

not suitable for extension to larger tests of the automated facial search. In the main, this can be

attributed to shortcomings in the custody data which is uploaded to PND by police forces, not just in

terms of image quality but in the quality and quantity of associated metadata accompanying the images.

Evidence for this is provided by the following:

Duplicate Images

Over 40% of all searches launched found at least one (and often many more) duplicate images

already in the database. PND is designed to accept data from forces ‘as is’ and thus does not

carry out any checks that might highlight such duplication prior to enrolment in the database.

This issue of duplicate images is not a new problem. FR evaluations undertaken by PITO6 in

2006 using data from the FIND7 pilot indicated that around 60% of those images were present

more than once. With such a small sample size it is impossible to know if the 40% figure found

here is truly representative, but it should be noted that for a database of over 13 million images,

this would equate to around 5M images that are held in the database unnecessarily, with all of

the associated storage and licence costs. As many of the duplicates appear multiple times there

6 Police IT Organisation – the pre-cursor to the National Policing Improvement Agency 7 Facial Images National Database – a PITO project to establish a national custody image database but which resulted in a pilot system but

which was subsequently cancelled by the NPIA in 2008

Page 16: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

16 OFFICIAL - SENSITIVE

is reason to believe the true figure may actually be much higher.

Images assigned to multiple Match Groups

On receipt of custody records from individual forces, PND assigns them into Match Groups. If

there is sufficient demographic data to link a record to an existing one, it will be assigned to that

Match Group, otherwise a new group will be created. As with duplicate images, shortcomings in

the custody data uploaded to PND by forces often result in insufficient metadata to link records of

an individual, with the result that a single individual appears in multiple match groups.

The conclusion from this study is that the use of Match Groups by PND makes analysis of the

data from tests like these extremely difficult.

Comparison Scores and Matching Thresholds.

Despite concerns over the statistical validity of the results from such a small sample size it has been

possible to plot comparison score distributions for the manually confirmed ‘matching’ and ‘non-matching’

searches.

In an ‘ideal’ biometric system, the two distributions would be well separated, allowing a threshold to be

set that returned all of the ‘true’ matches but no ‘false’ matches. In practice this is rarely, if ever,

observed.

The original aim of this test was to pave the way for a much larger scale evaluation using the same

methodology, thereby producing more statistically valid results. In the light of the results reported in this

paper, it now seems that this approach is inadequate.

Nevertheless, even with the limited data available the existence of two separate distributions is apparent,

with most non-matching images scoring no higher than 0.8, while the overwhelming majority of true

matches scored in excess of 0.95. These very high scores may be attributed to the fact that custody

images were used as the probes and, despite concerns around image quality, were matching against

images from previous arrests taken under the same conditions and often in the same custody suite.

Note also that using non-custody images taken with different camera equipment and in different

environments would produce very different (and generally lower) comparison scores and the differences

between matches and non-matches would not be so clear.

0

10

20

30

40

50

60

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1.0

Number of

images

Comparison score

Comparison Score distribution

True Matches True Non-Matches

Page 17: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

17 OFFICIAL - SENSITIVE

8 Conclusions and Recommendations

8.1 General

As highlighted in the House of Commons Science and Technology Committee’s 6th report (Current and

Future Uses of Biometric Data and Technologies):

‘When biometric systems are employed by the state in ways that impact upon citizens’ civil

liberties, it is imperative that they are accurate and dependable. Rigorous testing and evaluation

must therefore be undertaken prior to, and after, deployment, and details of performance levels

published.’

The evaluation approach detailed in this report was developed as a possible way of mitigating the fact

that no formal accuracy assessment had been carried out prior to operational deployment of the facial

search capability on PND.

In terms of the original evaluation objectives, the small sample size and lack of any ground truth for the

images on PND means that it has not been possible to draw any firm conclusions with regards to search

accuracy and, given the very ‘noisy’ nature of the data, even large scale tests using this approach seem

unlikely to succeed. Nonetheless, in the absence of any additional funding or contractual obligation on

the Systems Integrator to carry out such performance testing, this work does still provide insights as to

the way facial images are currently stored on PND and their suitability for use with an automated facial

recognition algorithm as currently implemented on PND.

RECOMMENDATION 1. It is essential that performance assessment of biometric systems be

undertaken as part of the implementation process, prior to operational deployment, A ‘lessons’

learned’ exercise should be undertaken to understand why such formal testing was not included

within the scope of this particular deployment.

RECOMMENDATION 2. Accuracy assurance should be embedded as an essential requirement

in the development and implementation of future systems which include a biometric element

within them. Requirements governing this must be adequately covered within the contractual

arrangements with the supplier(s).

OBSERVATION 1. The benefits of undertaking a full scale evaluation on the current system at

this stage in the contract would most likely be outweighed by the cost as no suitable test

environment currently exists. However, future enhancements to the system may provide

opportunities whereby such testing could be accommodated for relatively little additional

expense. For example, if a second (backup) FR capability for PND was to be procured to

provide business continuity, it could potentially also be used for accuracy evaluation and other

testing purposes, provided the requirements for such activities are identified at the design stage.

8.2 Image Quality

The quality of custody images being uploaded to PND by forces is a cause for concern, with none of the

images used in this test fully meeting the 2008 police custody image standard (which itself was only ever

intended to be a minimum standard).

While PND accepts data ‘as is’ from forces, it is understood that steps have now been taken to exclude

images smaller than 10KB from enrolment into the FR database and to recommend an optimum image

file size of 50KB to forces. However these measures alone are not sufficient to guarantee image quality

as they do not take into account other aspects of the quality of the image (e.g. what percentage of the

image the head actually occupies).

Page 18: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

18 OFFICIAL - SENSITIVE

Hence it is recommended that:

RECOMMENDATION 3. Consideration should be given to implementing a dedicated face image

quality software check as part of the FR enrolment process on PND, with parameters chosen8 to

exclude those images that are unsuitable for use with an FR algorithm.

RECOMMENDATION 4. Periodic feedback should be provided to forces on the percentage of

images which are rejected along with the reasons for rejection, in order to encourage the capture

of better quality images in the future.

RECOMMENDATION 5. By the time an image is uploaded to PND it is too late to address any

shortcomings in quality. In the longer term therefore the use of image quality software at the

point of capture is recommended (similar to that used on Livescan units for fingerprints). This

would require the facial image of the subject to be re-captured if it failed to meet certain quality

criteria considered necessary for effective use with facial recognition systems.

OBSERVATION 2. While not directly relevant to this particular evaluation it should also be noted

that internationally there is a trend towards taking much higher resolution custody images, as well

as taking images at multiple pose angles, to support forensic facial image comparison as well as

automated facial image matching.

8.3 Comparison Thresholds and Candidate List Length

The selection of appropriate matching thresholds is very dependent on the type of images being used,

both those in the database and the probe images. This particular evaluation only used custody images

as probes and the results suggest that a very high threshold setting would yield acceptable results in

such cases. However this type of image is not typical of those used in the majority of operational cases.

In general the greater the difference in image quality between the probe image and those in the

database, the lower the threshold scores for true matches are likely to be.

Prior to these tests the supplier’s default setting of 0.7 was in use, but a decision has now been taken by

the National User Group (NUG) to lower this to 0.65. This has the effect of increasing the number of

candidate images returned from the database for the investigating officer to review manually, but

increases the likelihood of a true match being returned in the list (if one actually exists in the database).

Balancing threshold settings and the maximum number of respondents returned, in order to maximise

the operational benefit that the FR (or any other biometric) capability delivers, is not a trivial exercise and

would require a far more extensive test than the one being reported on here.

RECOMMENDATION 6. The data from this evaluation is insufficient to draw any firm

conclusions about what the correct threshold setting should be. A formal process should be

introduced to monitor the impact of the recent change in threshold from 0.7 to 0.65, and feedback

sought from the user community as to whether the lower value has led to an increase in the

number of matches being found.

8 Note that optimisation of face image quality software requires significant testing to ensure that the parameters chosen are appropriate

for a particular application. The large variations seen in image size and quality would make this particularly challenging in the case of the police custody images currently being uploaded to PND; balancing the need to exclude those that provide no value (and may actually degrade overall performance) with the benefits of having as many images as possible in the database.

Page 19: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

19 OFFICIAL - SENSITIVE

8.4 Multiple Match Groups and Duplicate Images

Although the test sample size was small, this evaluation suggests that there is a high proportion of

duplicate images on PND, both within a single Match Group and across multiple Match Groups. In many

biometric systems de-duplication of the data is routinely undertaken to improve data integrity and to

reduce the total number of images / templates that need to be stored, thereby reducing both storage and

licence costs.

These tests suggest that there is the potential to reduce the total number of images stored on PND (and

thus templates in the FR database) by around a third, and possibly much more. The cost of extending

the Cognitec licence beyond 10M templates is understood to have been £120k plus an ongoing charge

of £18k per year thereafter, something that would not have been necessary had de-duplication been built

into the enrolment process or, preferably, addressed at a local force level prior to uploading images onto

PND.

The large number of multiple Match Groups associated with a duplicate image of a single individual is

largely due to the poor quality / lack of metadata associated with the images, preventing consolidation

into a single Match Group. While data quality will always be a challenge on a system designed to store

intelligence data there is no technical reason why this should be the case with custody images which are

taken in a controlled environment alongside fingerprints and DNA.

The police image standard developed by PITO / NPIA provided some information on the metadata that

should be provided with the images and the NPIA also developed a Corporate Data Model to improve

consistency and interoperability of police data. However, take up of both of these with respect to

custody images has been limited to date. In the longer term, the development and implementation of

common data formats should be mandated.

RECOMMENDATION 7. Police forces need to be made aware of the importance of properly

managing their local custody image collections and capture processes if the benefits of having

national collection are to be fully realised.

RECOMMENDATION 8. Police forces should be made aware of the importance of providing

sufficient and accurate metadata with the images, so that records relating to the same subject

can be properly linked. In the longer term, the development and implementation of common data

formats should be mandated.

RECOMMENDATION 9. Some of the issues observed with duplicate images and inconsistent

meta data are a result of the fragmented landscape in forces for custody imaging systems and

processes. Alternative approaches to the storage and management of custody image data

should be explored. Better integration between local custody imaging systems and other national

systems (e.g. IDENT1 and PNC) would be one means of improving data integrity. Alternatively a

centralised solution such as is currently in place for fingerprints and DNA data would improve

data quality and integrity and should in the longer term be more cost effective than continuing to

maintain separate systems in each force.

Page 20: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

20 OFFICIAL - SENSITIVE

Annex A – An example of the probe image data spreadsheet

Enquiry Image Details

UniqueID / Test

Image number

Subject details Image Details

M/F Ethnicity Age Compliant pose

Glasses Beard / Moustache

Other comments

File size (kB)

Image dimensions

I2I distance

Overall quality (1-10)

Page 21: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,

OFFICIAL – SENSITIVE

21 OFFICIAL - SENSITIVE

Annex B – An example of the results analysis spreadheet

Enquiry Image_ID Score Part 2 Match Group Part 2 Image ID Part 2 Score Part 1 Match Group Part 1 Image ID Part 1

1075 1 96534931561 57D1FDD5C44C404EBA2BEB1F90D8FCC3 0.60667384 349464132480 F2E014A7F26442538658D5C891478A2C

1085 1 835941528399 87F522589B634E2B8D272182D631A53E 0.7349086 74831202655 94FB87D83306495D930B52DAD221C06D

1090 1 3748440744 1B0AF15831BE4B7C954B005BCCBA5C15 1 3748440744 1B0AF15831BE4B7C954B005BCCBA5C15

1110 1 108686818734 67DEF9451429415D8A5815B641A339FB 0.9995563 108686818734 5454E7AF7BAE45EDA8D8681ACDBC4E99

3005 1 65566390764 F40C85415E584B18A932BA3F9B79073A 1 65566390764 775558CAFAC84DD592E4E895819319FE

2012 1 613539372441 A9292FBDF54A4285967EB6AE916435EE 0.6472585 50152545500 D06C224236A641F1B53E5F964E98922B

3000 0.72495764 421996099964 8670F9F5D87F416E8F24DF2598BE6F3F 1 738700910263 09352D5150E947C3BBC4C1E8F2B31E9A

3003 1 243620570040 121BF5DFAD8148C0BC83CCBD77D85517 1 243620570040 121BF5DFAD8148C0BC83CCBD77D85517

3005 1 65566390764 F40C85415E584B18A932BA3F9B79073A 0.8936661 65566390764 775558CAFAC84DD592E4E895819319FE

3006 0.9999833 718376855844 C321E7803CC04604B0904E9293979712 0.9999833 718376855844 4F639197E4A64BDA9E97BA2A3EF4336B

Page 22: PND Facial Search Accuracy Evaluation FOI/Crime/133 18 HO PND Facial Se… · Database (PND) in 2013. There is no large scale test environment for the PND facial search capability,