Skip to content

HttpProtocol (both okhttp and apache) race condition while having different proxies in different threads #1247

@chhsiao90

Description

@chhsiao90

What kind of issue is this?

  • Question. This issue tracker is not the best place for questions. If you want to ask how to do
    something, or to understand why something isn't working the way you expect it to, use StackOverflow
    instead with the label 'stormcrawler': https://stackoverflow.com/questions/tagged/stormcrawler

  • Bug report. If you’ve found a bug, please include a test if you can, it makes it a lot easier to fix things. Use the label 'bug' on the issue.

  • Feature request. Please use the label 'wish' on the issue.

Reproduce steps

To reproduce it, we can run the HttpProtocol main function with many urls with MultiProxyFactory

the crawler.conf

config:
  http.agent.name: test
  http.proxy.manager: org.apache.stormcrawler.proxy.MultiProxyManager
  http.proxy.file: proxies
  http.robots.file.skip: true

the proxies file

http://first:password@proxy1:8888
http://second:password@proxy2:8888

Root cause

The HttpProtocol (both okhttp and apache) is not thread-safe

  • the same instance which was initiated by ProxyFactory may be used in different bolts (different workers) at same time
  • the shared request/client builder was manipulated by different bolt/thread at same time

Example 1 (wrong proxy auth)

  • (Thread 2) builder.setProxy(secondProxy)
  • (Thread 1) builder.setProxy(firstProxy)
  • (Thread 1) builder.setAuth(firstAuth)
  • (Thread 2) builder.setAuth(secondAuth)
  • (Thread 1) builder.build()
  • We'll have firstProxy + secondAuth

Example 2 (wrong proxy used)

  • (Thread 1) builder.setProxy(firstProxy)
  • (Thread 1) builder.setAuth(firstAuth)
  • (Thread 2) builder.setProxy(secondProxy)
  • (Thread 2) builder.setAuth(secondAuth)
  • (Thread 1) builder.build()
  • Now both requests use the second proxy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions