scraping

Description

When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b'' error

As I see, most of the people have this issue the other way around.

Steps to Reproduce

Scrape a http link with proxy

**Expected

Tabula API version: 1.2.1.18052200
Filename: 3_2019년_통계부록.pdf
Internal Server Error (500)
    
      
        Request Method:
        POST
      
      
        Request URL:
        http://127.0.0.1:8080/pdf/8a6599b3be99fda826cc0448d74f0f74dfd3d78d/data

lines must be orthogonal, vertical and horizontal

Got this while extracting table
[pdf file](https://drive.google.com/fil

What is the current behavior?

Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #

The urls using # are not followed.

If the current behavior is a bug, please provide the steps to reproduce

Try crawling a website like mykita.com/en/

What is the motivation / use case for changing the behavior?

Though hashes are not ment to chan

The developer of the website I intend to scrape information from is sloppy and has left a lot of broken links.
When I execute an otherwise effective Ferret script on a list of pages, it stops altogether at every 404.
Is there a DOCUMENT_EXISTS or anything that would help the script go on?

There are several things not accurately documented/outdated:

-v2 is used the examples but does not work
# duckduckgo not supported although it is in the list of supported search engines
To get a list of all search engines --config is suggested but that just fails

Jun	JUL	Aug
	17
2019	2020	2021

scraping

Here are 1,917 public repositories matching this topic...

scrapy / scrapy

Description

Steps to Reproduce

gocolly / colly

psf / requests-html

code4craft / webmagic

tabulapdf / tabula

yujiosaka / headless-chrome-crawler

MontFerret / ferret

emadehsan / thal

NikolaiT / GoogleScraper

symfony / panther

oscarotero / Embed

transitive-bullshit / awesome-puppeteer

geziyor / geziyor

medialab / artoo

holgerd77 / django-dynamic-scraper

meetmangukiya / instagram-scraper

istresearch / scrapy-cluster

iawia002 / Lulu

sananth12 / ImageScraper

speed / newcrawler

scrapy / parsel

MorvanZhou / easy-scraping-tutorial

Lackoftactics / facebook_data_analyzer

phantombuster / nickjs

AlexMathew / scrapple

slotix / dataflowkit

online-judge-tools / oj

dufferzafar / geeksforgeeks.pdf

covidatlas / coronadatascraper

programminghistorian / jekyll

Improve this page

Add this topic to your repo