Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and...

42
Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arXiv preprint arXiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee

Transcript of Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and...

Page 1: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNNAuthor: Ross Girshick

Speaker: Charlie Liu

Date: Oct, 13th

Girshick, R. (2015). Fast R-CNN. arXiv preprint arXiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee

Page 2: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Result1

60%

61%

62%

63%

64%

65%

66%

67%

mAP

Accuracy

R-CNN Fast R-CNN

Girshick. Fast R-CNNGirshick et. Al. Rich feature hierarchies for accurate object detection and semantic segmentation

1. Dataset: PASCAL VOC 2012

Page 3: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Result1

60%

61%

62%

63%

64%

65%

66%

67%

mAP

Accuracy

R-CNN Fast R-CNN

Girshick. Fast R-CNNGirshick et. Al. Rich feature hierarchies for accurate object detection and semantic segmentation

1. Dataset: PASCAL VOC 2012

0

1

2

3

4

5

6

7

8

9

10

Speed-up

Training Speed-up

R-CNN Fast R-CNN

Page 4: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Result1

60%

61%

62%

63%

64%

65%

66%

67%

mAP

Accuracy

R-CNN Fast R-CNN

Girshick. Fast R-CNNGirshick et. Al. Rich feature hierarchies for accurate object detection and semantic segmentation

1. Dataset: PASCAL VOC 2012

0

1

2

3

4

5

6

7

8

9

10

Speed-up

Training Speed-up

R-CNN Fast R-CNN

0

50

100

150

200

250

Speed-up

Testing Speed-up

R-CNN Fast R-CNN

Page 5: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Object Detection after R-CNN

R-CNN

• 66.0% mAP1

• 1x Test Speed

SPPnet

• 63.1% mAP

• 24x than R-CNN

Fast R-CNN

• 66.6% mAP

• 10x than SPPnet

Girshick. Fast R-CNNHe et. al. Spatial pyramid pooling in deep convolutional networks for visual recognitionGirshick et. al. Rich feature hierarchies for accurate object detection and semantic segmentation

1. mAP based on PASCAL VOC 2007, results from Girshick

Page 6: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

R-CNN

Girshick et. al. Rich feature hierarchies for accurate object detection and semantic segmentation

Page 7: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

R-CNN Limitations

• Too slow• 13s/image on a GPU

• 53s/image on a CPU

• VGG-Net 7x slower

Girshick et. al. Rich feature hierarchies for accurate object detection and semantic segmentation

Page 8: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

R-CNN Limitations

• Too slow

• Proposals need to be warped to a fixed size• Potential loss of accuracy

Girshick et. al. Rich feature hierarchies for accurate object detection and semantic segmentation

Page 9: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

R-CNN Limitations

• Cropping may loss the object’s information

• Warping may change the object’s appearance

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Output

FC layers

Conv layers

Crop/Warp

Image

Page 10: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Motivation

• SPP: Spatial Pyramid Pooling

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Output

FC layers

SPP layer

Conv layers

Image

Page 11: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Motivation

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Output

FC layers

SPP layer

Conv layers

Image

• SPP: Spatial Pyramid Pooling

• Fixed size input

• Arbitrary size output

Page 12: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Motivation

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Output

FC layers

SPP layer

Conv layers

Image

• SPP: Spatial Pyramid Pooling

Page 13: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Motivation

R-CNN: 2000 nets on 1 image SPPnet: 1 net on 1 image

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Output

FC layers

SPP layer

Conv layers

Image

Page 14: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Motivation

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Output

FC layers

SPP layer

Conv layers

Image

• SPP Detection

Page 15: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Result

R-CNN SPPnet 1-scale SPPnet 5-scale

mAP 58.5 58.0 59.2

GPU time / img 9s 0.14s 0.38s

Speed-up2 1x 64x 24x

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

VOC 20071

1. mAP based on PASCAL VOC 2007, results from He et al.2. Speed stands for testing speed

Page 16: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Limitations

• High memory consumption

store

Conv5

feature

map

Conv5

feature

map

Conv5

feature

map

conv

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual RecognitionLiliang Zhang, Detection: From R-CNN to Fast R-CNN

Page 17: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Limitations

• High memory consumption

• Training is inefficient• Inefficient back-propagation through SPP

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Page 18: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

SPPnet: Limitations

• High memory consumption

• Training is inefficient• Inefficient back-propagation through SPP

• Multi-stage pipeline

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

ConvLayers

FC layers Classifier Regressor

Page 19: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: What’s New

Girshick et. al. Rich feature hierarchies for accurate object detection and semantic segmentationGirshick. Fast R-CNN

Page 20: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: What’s New

Girshick et. al. Rich feature hierarchies for accurate object detection and semantic segmentationGirshick. Fast R-CNN

Page 21: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: What’s New

Girshick. Fast R-CNN

• RoI Pooling Layer • ≈ one scale SPP layer

Page 22: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: What’s New

Girshick. Fast R-CNN

• Sibling Classifier & Regressor Layers

Page 23: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: What’s New

Girshick. Fast R-CNN

Page 24: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: What’s New

Girshick. Fast R-CNNLiliang Zhang, Detection: From R-CNN to Fast R-CNN

• Joint Training – one stage framework• Joint them together: feature extractor, classifier, regressor

Page 25: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Advantages

• One fine-tuning stage• Optimizes the classifier and bbox regressor

• Loss function:

• L = Lclassifier + λLregressor

Girshick. Fast R-CNN

Page 26: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Advantages

• One fine-tuning stage

• Fast training• Using mini-batch stochastic gradient descent

• E.g. 2 images * 64 RoIs each

• 64x faster than 128 images * 1 RoI each (strategies of R-CNN, SPPnet)

Girshick. Fast R-CNN

Page 27: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

• One fine-tuning stage

• Fast training

• Efficient back-propagation

Girshick. Fast R-CNNLiliang Zhang, Detection: From R-CNN to Fast R-CNN

Fast R-CNN: Advantages

Page 28: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Advantages

• One fine-tuning stage

• Fast training

• Efficient back-propagation

• Scale invariance• In practice, single scale is good enough

• Single scale: faster x10 than SPP-Net

Girshick. Fast R-CNNLiliang Zhang, Detection: From R-CNN to Fast R-CNN

Conv5

feature

map

Conv5

feature

map

Conv5

feature

map

conv

Page 29: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Advantages

• One fine-tuning stage

• Fast training

• Efficient back-propagation

• Scale invariance

• Fast detecting• Truncated SVD1 W ≈ UΣtV

T

• Single FC layer -> two FC layers

• Reduce parameters2 from uv to t(u + v)

Girshick. Fast R-CNNLiliang Zhang, Detection: From R-CNN to Fast R-CNN

1. U: size(u * t); Σ: size(t * t); V: size(v * t); 2. In practice, t << min(u, v)

Page 30: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Advantages

• One fine-tuning stage

• Fast training

• Efficient back-propagation

• Scale invariance

• Fast detecting• Truncated SVD1 W ≈ UΣtV

T

• Single FC layer -> two FC layers

• Reduce parameters2 from uv to t(u + v)

Girshick. Fast R-CNNLiliang Zhang, Detection: From R-CNN to Fast R-CNN

1. U: size(u * t); Σ: size(t * t); V: size(v * t); 2. In practice, t << min(u, v)

Page 31: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Result1

0

1

2

3

4

5

6

7

8

9

10

Training Speed-up

Training Speed-up

R-CNN SPPnet Fast R-CNN

Girshick. Fast R-CNN 1. Results from Girshick

R-CNN SPPnet Fast R-CNN

Train Time (h) 84 25 9.5

Train Speedup 1x 3.4x 8.8x

Test Rate (s/im) 47.0 2.3 0.22

Test Speedup 1x 20x 213x

VOC07 mAP 66.0% 63.1% 66.6%

Results using VGG16

Page 32: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Result1

0

50

100

150

200

250

Training Speed-up

Testing Speed-up

R-CNN SPPnet Fast R-CNN

Girshick. Fast R-CNN 1. Results from Girshick

R-CNN SPPnet Fast R-CNN

Train Time (h) 84 25 9.5

Train Speedup 1x 3.4x 8.8x

Test Rate (s/im) 47.0 2.3 0.22

Test Speedup 1x 20x 213x

VOC07 mAP 66.0% 63.1% 66.6%

Results using VGG16

Page 33: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Result1

Girshick. Fast R-CNN 1. Results from Girshick

R-CNN SPPnet Fast R-CNN

Train Time (h) 84 25 9.5

Train Speedup 1x 3.4x 8.8x

Test Rate (s/im) 47.0 2.3 0.22

Test Speedup 1x 20x 213x

VOC07 mAP 66.0% 63.1% 66.6%

Results using VGG16

61%

62%

63%

64%

65%

66%

67%

mAP

Accuracy

R-CNN SPPnet Fast R-CNN

Page 34: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Discussions

• Which layers to fine-tune?• Fine-tuning conv1: over-runs GPU memory.

• Fine-tuning conv2: mAP +0.3%, 1.3x slower.

• Fine-tuning conv3 and up.

Girshick. Fast R-CNN

Page 35: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Discussions

• Which layers to fine-tune? conv3 and up.

• Does multi-task training help?• Multi-task: potential to improve results

• Tasks influence each other through a shared representation

• mAP +0.8% to 1.1%

Girshick. Fast R-CNN

Page 36: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Discussions

• Which layers to fine-tune? conv3 and up.

• Does multi-task training help? Yes, mAP +0.8% to 1.1%.

• Scale invariance: single scale or multi scale?• Single scale: faster

• Multi scale: accurate

• Best trade-off: single scale

Girshick. Fast R-CNN

Page 37: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Discussions

• Which layers to fine-tune? conv3 and up.

• Does multi-task training help? Yes, mAP +0.8% to 1.1%.

• Scale invariance: single scale or multi scale? Single.

• Do we need more training data?• VOC07: 66.9% -> 70% with augment of VOC12 dataset.

Girshick. Fast R-CNN

Page 38: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Discussions

• Which layers to fine-tune? conv3 and up.

• Does multi-task training help? Yes, mAP +0.8% to 1.1%.

• Scale invariance: single scale or multi scale? Single.

• Do we need more training data? Yes.

• Do SVMs outperform softmax?• SVM: Yes/No; Softmax: 1 vs. all.

• Softmax: mAP +0.1% to 0.8%, competition between classes.

Girshick. Fast R-CNN

Page 39: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Discussions

• Which layers to fine-tune? conv3 and up.

• Does multi-task training help? Yes, mAP +0.8% to 1.1%.

• Scale invariance: single scale or multi scale? Single.

• Do we need more training data? Yes.

• Do SVMs outperform softmax? Softmax, mAP +0.1% to 0.8%.

• Are more proposals always better?

Girshick. Fast R-CNN

Page 40: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Discussions

Girshick. Fast R-CNN

Page 41: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Fast R-CNN: Discussions

• Which layers to fine-tune? conv3 and up.

• Does multi-task training help? Yes, mAP +0.8% to 1.1%.

• Scale invariance: single scale or multi scale? Single.

• Do we need more training data? Yes.

• Do SVMs outperform softmax? Softmax, mAP +0.1% to 0.8%.

• Are more proposals always better? No.

Girshick. Fast R-CNN

Page 42: Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

Questions?Thanks!

Girshick, R. (2015). Fast R-CNN. arXiv ECS 289G 001 Paper Presentation, Oct 13th