BING:Binarized normed gradients for objectness estimation at 300fps

2019-05-14 13:25:38ingingChengYunLiuWenYanLinZimingZhangPaulRosinandPhilipTorr

Computational Visual Media 2019年1期

M ing-M ing Cheng(),Yun Liu ,Wen-Yan Lin ,Ziming Zhang,Paul L.Rosin ,and Philip H.S.Torr

Abstract Training a generic objectness measure to produce object proposals has recently become of signif icant interest.We observe that generic objects with well-def ined closed boundaries can be detected by looking at the norm of gradients,with a suitable resizing of their corresponding image windows to a small f ixed size.Based on this observation and computational reasons,weproposeto resizethewindow to 8×8 and use the norm of the gradients as a simple 64D feature to describe it,for explicitly training a generic objectness measure.We further show how the binarized version of this feature,namely binarized normed gradients(BING),can be used for ef ficient objectnessestimation,which requiresonly a few atomic operations(e.g.,add,bit wise shif t,etc.).To improve localization quality of the proposals while maintaining ef ficiency,we propose a novel fast segmentation method and demonstrate its ef fectiveness for improving BING’s localization performance,when used in multithresholding straddling expansion (MTSE) postprocessing.On the challenging PASCAL VOC2007 dataset,using 1000 proposalsper imageand intersectionover-union threshold of 0.5,our proposal method achieves a 95.6%object detection rate and 78.6%mean average best overlap in less than 0.005 second per image.

Keywords object proposals;objectness;visual attention;category agnostic proposals

1 Introduction

As suggested in pioneering research[1,2],objectness is usually taken to mean a value which ref lects how likely an image window covers an object in any category.A generic objectness measure has great potential to be used as a pre-f ilter for many vision tasks,including object detection[3–5],visual tracking[6,7],object discovery[8,9],semantic segmentation[10,11],content aware image retargeting [12],and action recognition[13].Especially for object detection,proposal-based detectors have dominated recent state-of-the-art performance.Compared with sliding windows,objectnessmeasurescan signif icantly improve computational ef ficiency by reducing the search space,and system accuracy by allowing the use of complex subsequent processing during testing.However,designing a good generic objectnessmeasure method is dif ficult,and should:

? achieve a high object detection rate(DR),as any undetected objects rejected at this stage cannot be recovered later;

? possess high proposal localization accuracy,measured by average best overlap(ABO)for each object in each class and mean average best overlap(MABO)across all classes;

? be highly computationally ef ficient so that it is useful in realtime and large-scale applications;

? produce a small number of proposals,to reduce the amount of subsequent precessing;

? possess good generalization to unseen object categories,so that the proposals can be used in various vision tasks without category biases.To the best of our knowledge,no prior method can satisfy all of these ambitious goals simultaneously.Research from cognitive psychology[14,15]and neurobiology[16,17]suggests that humans have a strong ability to perceive objects before identifying them.Based on the observed human reaction time and the biological estimated signal transmission time,human attention theorieshypothesize that thehuman visual system processes only parts of an image in detail,while leaving others nearly unprocessed.This further suggests that before identifying objects,simple mechanisms in the human visual system select possible object locations.

In this paper,we propose a surprisingly simple and powerful feature which we call“BING”,to help search for objects using objectness scores.Our work is motivated by the concept that objects are standalone things with well-def ined closed boundaries and centers[2,18,19],even if the visibility of these boundaries depends on the characteristics of the background and of occluding foreground objects.We observe that generic objects with well-def ined closed boundaries share surprisingly strong correlation in terms of the norm of their gradients(see Fig.1 and Section 3),after resizing their corresponding image windows to a small f ixed size(e.g.,8×8).Therefore,in order to ef ficiently quantify the objectness of an image window,we resize it to 8×8 and use the norm of the gradients as a simple 64D feature for learning a generic objectness measure in a cascaded SVM framework.We further show how the binarized version of the norm of gradients feature,namely binarized normed gradients(BING),can be used for ef ficient objectness estimation of image windows,using only a few atomic CPU operations(add,bit wise shif t,etc.).The BING feature’s simplicity,whileusing advanced speed-up techniquesto makethe computational time tractable,contrasts with recent state-of-the-art techniques[2,20,21]which seek increasingly sophisticated features to obtain greater discrimination.

Fig.1 Although object(red)and non-object(green)windows vary greatly in image space(a),at proper scales and aspect ratios which correspond to a small fixed size(b),their corresponding normed gradients(NG features)(c),share strong correlation.We learn a single 64D linear model(d)for selecting object proposals based on their NG features.

The original conference presentation of BING[22]has received much attention.Its ef ficiency and high detection rates make BING a good choice in a large number of successful applications that require category independent object proposals[23–29].Recently,deep neural network based object proposal generation methods have become very popular due to their high recall and computational ef ficiency,e.g.,RPN[30],YOLO900[31],and SSD[32].However,these methods generalize poorly to unseen categories,and rely on training with many ground-truth annotations for the target classes.For instance,the detected object proposals of RPN are highly related to the training data:after training it on the PASCAL VOC dataset[33],the trained model will aim to only detect the20 classesof objectstherein and performspoorly on other datasetslike MSCOCO(see Section 5.4).Its poor generalization ability has restricted its usage,so RPN is usually only used in object detection.In comparison,BING is based on lowlevel cues concerning enclosing boundaries and thus can produce category independent object proposals,which has demonstrated applications in multi-label image classif ication[23],semantic segmentation[25],video classif ication[24],co-salient object detection[29],deep multi-instance learning[26],and video summarisation[27].However,several researchers[34–37]have noted that BING’s proposal localization is weak.

This manuscript further improves proposal localization over the method described in the conferenceversion[22]by applying multi-thresholding straddling expansion(MTSE)[38]as a postprocessing step.Standard MTSE would introduce a signif icant computational bottleneck because of its image segmentation step.Therefore we propose a novel image segmentation method, which generates accurate segments much more ef ficiently.Our approach starts with a GPU version of the SLIC method[39,40]to quickly obtain initial seed regions(superpixels)by performing oversegmentation.Region merging is then performed based on average pixel distances.We replace the method from Ref.[41]in MTSE with this novel grouping method[42],and dub the new proposal system BING-E.

We have extensively evaluated our objectness methods on the PASCAL VOC2007[33]and Microsoft COCO[43]datasets.The experimental results show that our method ef ficiently(at 300 fps for BING and 200 fps for BING-E)generates a small set of data-driven,category-independent,and highquality object windows.BING is able to achieve 96.2%detection rate(DR)with 1000 windows and intersection-over-union(IoU)threshold 0.5.At the increased IoU threshold of 0.7,BING-E can obtain 81.4%DR and 78.6%mean average best overlap(MABO).Feeding the proposals to the fast RCNN framework[4]for an object detection task,BING-E achieves 67.4%mean average precision(MAP).Following Refs.[2,20,21],we also verify the generalization ability of our method.When training our objectness measure on the VOC2007 training set and testing on the challenging COCO validation set,our method still achieves competitive performance.Compared to most popular alternatives[2,20,21,34,36,44–50],our method achieves competitive performance using a smaller set of proposals,while being 100–1000 times faster than them.Thus,our proposed method achieves signif icantly higher ef ficiency while providing state-of-the-art generic object proposals.This performance fulf ils a key previously stated requirement for a good objectness detector.Our source code is published with the paper.

2 Related works

Being able to perceive objects before identifying them is closely related to bottom up visual attention(saliency).According to how saliency is def ined,we broadly classify related research into three categories:f ixation prediction,salient object detection,and objectness proposal generation.

2.1 Fixation prediction

Fixation prediction models aim to predict human eye movements[51,52].Inspired by neurobiological research on early primate visual systems,Ittiet al. [53]proposed one of the f irst computational models for saliency detection,which estimates centersurround dif ferences across multi-scale image features.Ma and Zhang[54]proposed a fuzzy growing model to analyze local contrast based saliency.Harel et al.[55]proposed normalizing center-surrounded featuremaps for highlighting conspicuous parts.Although f ixation point prediction models have developed remarkably,the prediction results tend to highlight edges and corners rather than entire objects.Thus,these models are unsuitable for generating generic object proposals.

2.2 Salient object detection

Salient object detection models try to detect the most attention-grabbing objects in a scene,and then segment the whole extent of those objects[56–58].Liu et al.[59]combined local,regional,and global saliency measurements in a CRF framework.Achanta et al.[60]localized salient regions using a frequencytuned approach.Cheng et al.[61]proposed a salient object detection and segmentation method based on region contrast analysis and iterative graph based segmentation.More recent research has also tried to produce high-quality saliency mapsin a filtering-based framework[62].Such salient object segmentation has achieved great success for simple images in image scene analysis[63–65],and content aware image editing[66,67];it can be used as a cheap tool to process a large number of Internet images or build robust applications[68–73]by automatically selecting good results[61,74].However,these approaches are less likely to work for complicated images in which many objects are present but are rarely dominant(e.g.,PASCAL VOC images).

2.3 Objectness prop osal generation

These methods avoid making decisions early on,by proposing a small number(e.g.,1000)of categoryindependent proposals that are expected to cover all objects in an image[2,20,21]. Producing rough segmentations[21,75]as object proposals has been shown to be an ef fective way of reducing search spaces for category-specif ic classif iers,whilst allowing the usage of strong classif iers to improve accuracy. However,such methods[21,75]are very computationally expensive.Alexe et al.[2]proposed a cue integration approach to get better prediction performance more ef ficiently.Broadly speaking,two main categories of object proposal generation methods exist,region based methods and edge based methods.

Region based object proposal generation methods mainly look for sets of regions produced by image segmentation and use the bounding boxes of these sets of regions to generate object proposals.Since image segmentation aims to cluster pixels into regions expected to represent objects or object-parts,merging certain regions is likely to f ind complete objects.A large literature has focused on this approach.Uijlings et al.[20]proposed a selective search approach,which combined the strength of both an exhaustive search and segmentation,to achieve higher prediction performance.Pont-Tuset et al.[36]proposed a multiscale method to generate segmentation hierarchies,and then explored the combinatorial space of these hierarchical regions to produce high-quality object proposals.Other well-known algorithms[21,45–47,49]fall into this category as well.

Edge based object proposal generation approaches use edges to explore where in an image complete objects occur.As pointed out in Ref.[2],complete objects usually have well-def ined closed boundaries in space,and various methods have achieved high performance using this intuitive cue.Zitnick and Doll′ar[34]proposed a simple box objectnessscorethat measured the number of contours wholly enclosed by a bounding box,generating object bounding box proposals directly from edges in an ef ficient way.Lu et al.[76]proposed a closed contour measure def ined by a closed path integral.Zhang et al.[44]proposed a cascaded ranking SVM approach with an oriented gradient feature for ef ficient proposal generation.

Generic object proposals are widely used in object detection[3–5],visual tracking[6,7],video classif ication[24],pedestrian detection[28],content aware image retargeting[12],and action recognition[13].Thus a generic objectness measure can benef it many vision tasks.In this paper,we describe a simple and intuitive object proposal generation method which generally achieves state-of-the-art detection performance,and is 100–1000 times faster than most popular alternatives[2,20,21](see Section 5).

3 BING for objectness measure

3.1 Preliminaries

Inspired by the ability of the human visual system to ef ficiently perceive objects before identifying them[14–17],we introduce a simple 64D norm-of-gradients(NG)feature(Section 3.2),as well as its binary approximation,i.e.,the binarized normed gradients(BING)feature(Section 3.4),for ef ficiently capturing the objectness of an image window.

To f ind generic objects within an image,we scan over a predef ined set of quantized window sizes(scales and aspect ratios①In all exp eriments,we test 36 quantized target window sizes{(W o,H o)},where W o,H o∈{16,32,64,128,256,512}.We resize the input image to 36 sizes so that 8×8 windows in the downsized images(from which we extract features),corresp ond to target windows.).Each window is scored with a linear model w∈R64(Section 3.3):where sl,gl,l,i,and(x,y)aref ilter score,NG feature,location,size,and position of a window,respectively.Using non-maximal suppression(NMS),we select a small set of proposalsfrom each size i.Zhao et al.[37]showed that thischoiceof window sizesalong with the NMSis close to optimal.Some sizes(e.g.,10×500)are less likely than others(e.g.,100×100)to contain an object instance.Thus we def ine the objectness score(i.e.,the calibrated f ilter score)as where vi,ti∈R are learnt coef ficient and bias terms for each quantized size i(Section 3.3).Note that calibration using Eq.(3),although very fast,is only required when re-ranking the small set of f inal proposals.

3.2 Normed gradients(NG)and objectness

Objects are stand-alone things with well-def ined closed boundaries and centers[2,18,19]although the visibility of these boundaries depends on the characteristics of the background and occluding foreground objects. When resizing windows corresponding to real world objects to a small f ixed size(e.g.,8×8,chosen for computational reasons that will be explained in Section 3.4),the norms(i.e.,magnitude)of the corresponding image gradients become good discriminative features,because of the limited variation that closed boundaries could present in such an abstracted view.As demonstrated in Fig.1,although the cruise ship and the person have huge dif ferences in terms of color,shape,texture,illumination,etc.,they share clear similarity in normed gradient space.To utilize this observation to ef ficiently predict the existence of object instances,we f irstly resize the input image to dif ferent quantized sizes and calculate the normed gradients of each resized image.The values in an 8×8 region of these resized normed gradients maps are def ined as a 64D vector of normed gradients(NG)①The normed gradient represents Euclidean norm of the gradient.feature of its corresponding window.

Our NG feature,as a dense and compact objectness feature for an image window,has several advantages.Firstly,no matter how an object changes its position,scale,and aspect ratio,its corresponding NG feature will remain roughly unchanged because the region for computing the feature is normalized.In other words,NG features are insensitive to change of translation,scale,and aspect ratio,which will be very useful for detecting objects of arbitrary categories.Such insensitivity in a property is one that a good objectness proposal generation method should have.Secondly,the dense compact representation of the NG feature makes it allow to be very ef ficiently calculated and verif ied,with great potential for realtime applications.

The cost of introducing such advantages to the NG feature is loss of discriminative ability.However,this is not a problem as BING can be used as a pre-f ilter,and the resulting false-positives can be processed and eliminated by subsequent category specif ic detectors.In Section 5,we show that our method results in a small set of high-quality proposals that cover 96.2%of the true object windows in the challenging VOC2007 dataset.

3.3 Learning objectness measurement with NG

To learn an objectness measure for image windows,we follow the two stage cascaded SVM approach[44].

Stage I.We learn a single model w for Eq.(1)using a linear SVM[77].NG featuresof ground truth object windows and randomly sampled background windows are used as positive and negative training samples respectively.

Stage II.To learn viand tiin Eq.(3)using a linear SVM[77],we evaluate Eq.(1)at size i for training images and use the selected(NMS)proposals as training samples,their f ilter scoresas1D features,and check their labeling using training image annotations(see Section 5 for evaluation criteria).As can be seen in Fig.1(d),the learned linear model w(see Section 5 for experimental settings)looks similar to the multi-size center-surrounded patterns[53]hypothesized as a biologically plausible architecture in primates[15,16,78].The large weights along the borders of w favor a boundary that separates an object(center)from its background(surround).Compared to manually designed centersurround patterns[53],our learned w captures a more sophisticated natural prior.For example,lower object regions are more often occluded than upper parts.Thisisrepresented by w placing lessconf idence in the lower regions.

3.4 Binarized normed gradients(BING)

To make use of recent advantages in binary model approximation[79,80],we describe an accelerated version of the NG feature,namely binarized normed gradients(BING),to speed up the feature extraction and testing process.Our learned linear model w∈R64can be approximated by a set of basis vectorsusing Algorithm 1,where Nwdenotes the number of basis vectors,denotes a single basis vector,andβj∈ R denotes its corresponding coef ficient.By further representing each ajusing a binary vector and its complement:where,a binarized feature b can be tested using fast bit wise and and bit count operations(see Ref.[79]):

The key challenge is how to binarize and calculate NG features ef ficiently.We approximate the normed gradient values(each saved as a byt e value)using the top Ngbinary bits of the byt e values.Thus,a 64D NG feature glcan be approximated by Ngbinarized normed gradients(BING)features as

Algorithm 1 Binary approximate model w[79]Input:w,N w Output:{βj}N w j=1,{a j}N w j=1 Initialize residual:ε=w for j=1 to N w do a j=sign(ε)βj=a j,ε/||a j||2 (projectεonto a j)ε←ε?βj a j (update residual)end for

Notice that these BING features have dif ferent weights according to their corresponding bit position in the byt e values.

Naively determining an 8×8 BING feature requires a loop computing access to 64 positions.By exploring two special characteristics of an 8×8 BING feature,we develop a fast BING feature calculation algorithm(Algorithm 2),which enables using atomic updates(bit wise shif t and bit wise or)to avoid computing the loop.Firstly,a BING feature bx,yand its last row rx,yare saved in a single int 64 and a byt e variable,respectively.Secondly,adjacent BING features and their rows have a simple cumulative relation.As shown in Fig.2 and Algorithm 2,the operator bit wise shif t shifts rx?1,yby one bit,automatically through the bit which does not belong to rx,y,and makes room to insert the new bit bx,yusing the bit wise or operator.Similarly bit wise shif t shifts bx,y?1by 8 bits automatically through the bits which do not belong to bx,y,and makes room to insertrx,y.

Our ef ficient BING feature calculation shares its cumulative nature with the integral image representation[81].Instead of calculating a single scalar value over an arbitrary rectangle range[81],our method uses a few atomic operations(e.g.,add,bit w ise,etc.)to calculate a set of binary patterns over an 8×8 f ixed range.

Algorithm 2 Get BING features for W×H positions Comments:see Fig.2 for an explanation of variables Input:binary normed gradient map b W×H Output:BING feature matrix b W×H Initialize:b W×H=0,r W×H=0 for each position(x,y)in scan-line order d o r x,y=(r x?1,y1)|b x,y b x,y=(b x,y?18)|r x,y end for

Fig. 2 Variables:a BING feature b x,y,its last row r x,y,and last element b x,y.Notice that the subscripts i,x,y,l,k,introduced in Eq.(2)and Eq.(5),are locations of the whole vector rather than the indices of vector elements.We can use a single atomic variable(int 64 and byt e)to represent a BING feature and its last row,respectively,enabling ef ficient feature computation.

The f ilter score Eq.(1)for an image window corresponding to BING features bk,lcan be ef ficiently computed using:

To implement these ideas,we use the 1-D kernel[?1,0,1]to f ind image gradients gxand gyin the horizontal and vertical directions,calculate normed gradients using min(|gx|+|gy|,255)and save them in byt e values.By default,we calculate gradients in RGB color space.

4 Enhancing BING with region cues

BING is not only very ef ficient,but can also achieve a high object detection rate.However,in comparison to ABO or MABO,its performance is disappointing.When further applying BINGin someobject detection frameworks which use object proposals as input,like fast R-CNN,the detection rate is also poor.This suggests that BING does not provide good proposal localization.

Two reasons may cause this.On one hand,given an object,BING tries to capture its closed boundaries by resizing it to a small f ixed size and setting larger weights at the most probable positions.However,as shapes of objects are varied,the closed boundaries of objects will be mapped to dif ferent positions in the f ixed size windows.The learned model of NG features cannot adequately represent this variability across objects.On the other hand,BING is designed to only test a limited set of quantized window sizes.However,the sizes of objectsarevariable.Thus,to someextent,bounding boxes generated by BING are unable to tightly cover all objects.

In order to improve this unsatisfactory localization,we use multi-thresholding straddling expansion(MTSE)[38],which is an ef fective method to ref ine object proposals using segments.Given an image and corresponding initial bounding boxes,MTSE f irst aligns boxes with potential object boundaries preserved by superpixels,and then multi-thresholding expansion is performed with respect to superpixels straddling each box.In this way,each bounding box tightly covers a set of internal superpixels,signif icantly improving the localization quality of proposals.However,the MTSE algorithm is too slow;the bottleneck is segmentation[41].Thus,we use a new fast image segmentation method[42]to replace the segmentation method in MTSE.

Recently,SLIC [40]has become a popular superpixel generation method becauseof itsef ficiency;gSLICr,the GPU version of SLIC[39],can achieve a speed of 250 fps.SLIC aims to generate small superpixels and is not good at producing large image segments.In the MTSE algorithm,large image segments are needed to ensure accuracy,so it is not straightforward to use SLIC within MTSE.However,the high ef ficiency of SLIC makes it a good start for developing new segmentation methods.We f irst use gSLICr to segment an image into many small superpixels.Then,we view each superpixel as a node whose color is denoted by the average color of all its pixels,and the distance between two adjacent nodes is computed as the Euclidean distance of their color values.Finally,wefeed thesenodesinto a graph-based segmentation method to produce the f inal image segmentation[42].

We employ the full MTSE pipeline,and modify it to use our new segmentation algorithm,reducing the computation time from 0.15 s down to 0.0014 s per image.Incorporating this improved version of MTSE as a postprocessing enhancement step for BING gives our new proposal system,which we call BING-E.

5 Evaluation

5.1 Background

We have extensively evaluated our method on the challenging PASCAL VOC2007[33]and Microsoft COCO[43]datasets.PASCAL VOC2007 contains 20 object categories,and consists of training,validation,and test sets,with 2501,2510,and 4952 images respectively,having corresponding bounding box annotations.We use the training set to train our BING model and test on thetest set.Microsoft COCO consists of 82,783 images for training and 40,504 images for validation,with about 1 million annotated instances in 80 categories.COCO is more challenging because of its large size and complex image contents.

We compared our method to various competitive methods:EdgeBoxes[34]①ht t ps://gi t hub.com/pdol l ar/edges,CSVM[44]②ht t ps://zi mi ngzhang.wor dpr ess.com/,MCG[36]③ht tp://www.eecs.ber kel ey.edu/Resear ch/Pr oj ect s/CS/vision/grouping/mcg/,RPN[30]④ht t ps://gi t hub.com/r bgi r shick/py-f ast er-r cnn,Endres[21],Objectness[2],GOP[48],LPO[49],Rahtu[45],RandomPrim[46],Rantalankila[47],and SelectiveSearch[20],using publicly available code[82]downloaded from ht t ps://git hub.com/Cl oud-CV/obj ect-pr oposal s. All parameters for these methods were set to default values,except for Ref.[48],in which we employed(180,9)as suggested on the author’s homepage.To make the comparison fair,all methods except for the deep learning based RPN[30]were tested on the same device with an Intel i7-6700k CPU and NVIDIA GeForce GTX 970 GPU,with data parallelization enabled.For RPN,we utilized an NVIDIA GeForce GTX TITAN X GPU for computation.

Since objectness is often used as a preprocessing step to reduce the number of windows considered in subsequent processing,too many proposals are unhelpful.Therefore,we only used the top 1000 proposals for comparison.

In order to evaluate the generalization ability of each method,we tested them on the COCO validation dataset using the same parameters as for VOC2007 without retraining.Since at least 60 categories in COCO dif fer from those in VOC2007,COCO is a good test of the generalization ability of the methods.

5.2 Exp erimental setup

5.2.1 Discussion of BING

As shown in Table 1,by using the binary approximation to the learned linear f ilter(Section 3.4)and BING features,computing the response score for each image window only needs a f ixed small number of atomic operations.It iseasy to see that the number of positions at each quantized scale and aspect ratio is O(N),where N is the number of pixels in the image.Thus,computing response scores at all scales and aspect ratios also has computational complexity O(N).Furthermore,extracting the BING feature and computing the response score at each potential position(i.e.,an image window)can be calculated with information given by its 2 neighbors left and above.This means that the space complexity is also O(N).

For training, we f lip the images and the corresponding annotations.The positive samples are boxes that have IoU overlap with a ground truth box of at least 0.5,while the maximum IoU overlap with the ground truth for the negative sampling boxes is less than 0.5.

Tab le 1 Average number of atomic operations for computing objectness of each image window at dif ferent stages:calculate normed gradients,extract BING features,and get objectness score

Some window sizeswhose aspect ratiosare too large are ignored as there are too few training samples(less than 50)in VOC2007 for each of them.Our training on 2501 VOC2007 images takes only 20 seconds(excluding XML loading time).

We further illustrate in Table 2 how dif ferent approximation levels inf luence the result quality.From this comparison,we decided in all further experiments to use Nw=2,Ng=4.

5.2.2 Implementation of BING-E

In BING-E,removing some small BING windows,with Wo<30 or Ho<30,hardly degrades the proposal quality of BING-E while reducing the runtime spent on BING processing by half.When using gSLICr[39]to segment images into superpixels,we set the expected size of superpixels to 4×4.In the graph-based segmentation system[41,42],we use the scale parameter k=120,and the minimum number of superpixels in each produced segment is set to 6.We utilize the default multi-thresholds of MTSE:{0.1,0.2,0.3,0.4,0.5}.After ref inement,non-maximal suppression(NMS)is performed to obtain the f inal boxes,with an IoU threshold of NMSset to 0.8.All experiments used these settings.

5.3 PASCAL VOC2007

Tab le 2 Average result quality(DR using 1000 proposals)of BING at dif ferent approximation levels,measured by N w and N g in Section 3.4.N/A represents unbinarized

higher ef ficiency than traditional methods.Thus,we f irst compare our method with some competitors using detection recall metrics.Figure 3(a)shows detection recall when varying the IoU overlap threshold using 1000 proposals. EdgeBoxes and MCG outperform many other methods in all cases.RPN achieves very high performance when the IoU threshold is less than 0.7,but then drops rapidly.Note that RPN is the only deep learning based method amongst these competitors.BING’s performance is not competitive when the IoU threshold increases,but BING-E provides close to the best performance.It should be emphasized that both BING and BING-E are more than 100 times faster than most popular alternatives[20,21,34,36](see details in Table 3).The performance of BING and CSVM[44]almost coincide in all three subf igures,but BING is 100 times faster than CSVM.The signif icant improvement from BING to BING-E illustrates that BING is a strong basis that can be extended and improved in various ways.Since BING is able to run at about 300 fps,its variants can still be very fast.For example,BING-E can generate competitive candidates at over 200 fps,which is far beyond the performance of most other detection algorithms.

Figures 3(b)–3(d)show detection recall and MABO versus the number of proposals(#WIN)respectively.When the IoU threshold is 0.5,both BING and BING-E perform very well;when the

Tab le 3 Detection recall(%)using dif ferent IoU thresholds and#WIN on the VOC2007 test set

5.3.1 Results

As demonstrated by Refs.[2,20],a small set of coarse locations with high detection recall(DR)is suf ficient for ef fective object detection,and it allows expensive features and complementary cues to be involved in subsequent detection to achieve better quality and number of candidates is suf ficient,BING and BING-E outperform all other methods.In Fig.3(c),the recall curve of BING drops signif icantly,as it does in the MABO evaluation.This may be because the proposal localization quality of BING is poor.However,the performance of BING-E is consistently close to the best performance,indicating that it overcomes BING’s localization problem.

Fig.3 Testing results on PASCAL VOC2007 test set:(a)object detection recall versus IoU overlap threshold;(b,c)recall versus the number of candidates at IoU threshold 0.5 and 0.7 respectively;(d)MABO versus the number of candidates using at most 1000 proposals.

We show a numerical comparison of recall vs.#WIN in Table 3.BING-E always performs better than most competitors.The speeds of BING and BING-E are obviously faster than all of the other methods.Although EdgeBoxes,MCG,and SelectiveSearch perform very well,they are too slow for many applications.In contrast,BING-E is more attractive.It is also interesting to f ind that the detection recall of BING-E increases by 46.1%over BING using 1000 proposals with IoU threshold 0.7,which suggests that the accuracy of BING has lots of room for improvement by applying postprocessing.Table 4 compares ABO&MABO scores with the competitors.MCG always outperforms others by a big gap,but BING-E is competitive with all other methods.

Since proposal generation is usually a preprocessing step in vision tasks,we fed candidate boxes produced by objectness methods into the fast R-CNN[4]object detection framework to test the ef fectiveness of proposals in practical applications.The CNN model of fast R-CNN was retrained using boxes from the respective methods.Table 5 shows the evaluation results.In terms of MAP(mean average precision),the overall detection rates of all methods are quite similar. RPN performs slightly better,while our BING-E method gives very close to the best performance.Although MCG almost dominates the recall,ABO,and MABO metrics,it does not achieve the best performance on object detection,and is worse than BING-E.In summary we may say that BING-E provides state-of-the-art generic object proposals at a much higher speed than other methods.Finally,we illustrate sample results of varying complexity provided by our improved BINGE method for VOC2007 test images in Fig.5,to demonstrate our high-quality proposals.

Table 4 ABO&MABO(%)using at most 1000 proposals per image on the VOC2007 test set

5.3.2 Discussion

In order to perform further analysis,we divided the ground truths into dif ferent sets according to their window sizes,and tested some of the most competitive methods on these sets.Table 6 shows theresults.When the ground truth area is small,BING-E performs much worse than the other methods.As the ground truth area increases,thegap between BING-E and other state-of-the-art methodsgradually narrows,and BING-E outperforms all of them on the recall metric when the area is larger than 212.Figure 4 shows some failing examples produced by BINGE.Note that almost all falsely detected objects are small.Such small objects may have blurred boundaries making them hard to distinguish from background.

Table 5 Detection average precision(%)using fast R-CNN on the VOC2007 test set with 1000 proposals

Table 6 Recall/MABO(%)vs.area on VOC2007 test set with 1000 proposals and IoU threshold 0.5

Fig.4 True positive object proposals for VOC2007 test images using BING-E.

Note that MCG achieves much better performance on small objects,and it may be the main cause of the drop in detection rate when using MCG in the fast R-CNN framework.The fast R-CNN uses the VGG16[83]model,in which the convolutional layers are pooled several times.The size of a feature map will be just 1/24size of the original object when it arrives at the last convolutional layer of VGG16,and the feature map will be too coarse to classify such small instances.Thus,using MCG proposals to retrain the CNN model may confuse the network because of the detected small object proposals.As a result,MCG does not achieve the best performance in the object detection task although it outperforms others on recall and MABO metrics.

Fig.5 Some failure examples of BING-E.Failure means that the overlap between the best detected box(green)and ground truth(red)is less than 0.5.All images are from the VOC2007 test set.

5.4 M icrosoft COCO

In order to test the generalization ability of the various methods,we extensively evaluated them on the COCO validation set using the same parameters as for the VOC2007 dataset,without retraining.As this dataset is so large,we only compared against some of the more ef ficient methods.

Fig.6 Testing results on COCO validation dataset:(a)object detection recall versus IoU overlap threshold;(b,c)recall versus number of candidates at IoU thresholds 0.5 and 0.7 respectively;(d)MABO versus the number of candidates using at most 1000 proposals.

Figure 6(a)shows object detection recall versus IoU overlap threshold using dif ferent numbers of proposals.MCG always dominates the performance,but its low speed makes it unsuited to many vision applications.EdgeBoxes performs well when the IoU threshold is small,and LPO performs well for large IoU thresholds.The performance of BING-E is slightly worse than state-of-the-art performance.Both BING,Rahtu,and Objectness struggle on the COCO dataset,suggesting that these methods may be not robust in complex scenes.RPN performs very poorly on COCO,which means it is highly dependent on the training data.As noted in Ref.[82],a good object proposal algorithm should be category independent.Although RPN achieves good results on VOC2007,it is not consistent with the goal of designing a category independent object proposal method.

Figures 6(b)–6(d)show recall and MABO when varying the number of proposals.Clearly,RPN suf fers a big drop in performance over VOC2007.Its recall at IoU 0.5 and MABO are even worse than those of BING.BING and BING-E are very robust when transferring to dif ferent object classes.Table 7 shows a statistical comparison.Although BING and BING-E do not achieve the best performance,they obtain very high computational ef ficiency with a moderate drop in accuracy.The signif icant improvement from BING to BING-E suggests that BING would bea good basisfor combining with other more accurate bounding box ref inement methods in cases where the increased computational load is acceptable.

Tab le 7 Detection recall(%)using dif ferent IoU thresholds and#WIN on COCO validation set

6 Conclusions and future work

6.1 Conclusions

We have presented a surprisingly simple,fast,and high-quality objectness measure using 8×8 binarized normed gradients(BING)features.Computing the objectness of each image window at any scale and aspect ratio only needs a few atomic(add,bit wise,etc.)operations.To improve the localization quality of BING,we further proposed BING-E which incorporates an ef ficient image segmentation strategy.Evaluation results using the most widely used benchmarks(VOC2007 and COCO)and evaluation metrics show that BING-E can generate state-of-the-art generic object proposals at a signif icantly higher speed than other methods.Our evaluation demonstrates that BING is a good basis for object proposal generation.

6.2 Limitations

BING and BING-E predict a small set of object bounding boxes.Thus,they share similar limitations with all other bounding box based objectness measure methods[2,44]and classic sliding window based object detection methods[84,85].For some object categories(snakes,wires,etc.),a bounding box might not localize object instances as well as a segmentation region[21,47,75].

6.3 Future work

The high quality and ef ficiency of our method make it suitable for many realtime vision applications and uses based on large scale image collections(e.g.,ImageNet[86]).In particular,the binary operations and memory ef ficiency make our BING method suitable for low-power devices[79,80].Our speed-up strategy of reducing the number of tested windows is complementary to other speed-up techniques which try to reduce the subsequent processing time required for each location.The ef ficiency of our method solves the computational bottleneck of proposal based vision tasks such as object detection methods[4,87],enabling real-time high-quality object detection.

We have demonstrated how to generate a small set(e.g.,1000)of proposals to cover nearly all potential object regions,using very simple BING features and a postprocessing step.It would be interesting to introduce other additional cues to further reduce the number of proposals while maintaining a high detection rate[88,89],and to explore more applications[23–27,29,90]using BING and BING-E.To encourage future work,the source code will be kept up-to-date at ht t p://mmcheng.net/bing.

Acknowledgements

This research was supported by the National Natural Science Foundation of China(Nos.61572264,61620106008).

Computational Visual Media2019年1期

Computational Visual Media的其它文章: Shadow GAN:Shadow synthesis for virtual objects with conditional adversarial networks; Recurrent 3D attentional networks for end-to-end active object recognition; Image-based app earance acquisition of ef fect coatings; Real-time stereo matching on CUDA using Fourier descriptors and dynamic programming; Discernible image mosaic with edge-aware adaptive tiles; Automated p ebble mosaic stylization of images