scraping
Here are 1,917 public repositories matching this topic...
-
Updated
Jul 3, 2020 - Python
Tabula API version: 1.2.1.18052200
Filename: 3_2019년_통계부록.pdf
Internal Server Error (500)
Request Method:
POST
Request URL:
http://127.0.0.1:8080/pdf/8a6599b3be99fda826cc0448d74f0f74dfd3d78d/data
lines must be orthogonal, vertical and horizontal
Got this while extracting table
[pdf file](https://drive.google.com/fil
What is the current behavior?
Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #
The urls using # are not followed.
If the current behavior is a bug, please provide the steps to reproduce
Try crawling a website like mykita.com/en/
What is the motivation / use case for changing the behavior?
Though hashes are not ment to chan
The developer of the website I intend to scrape information from is sloppy and has left a lot of broken links.
When I execute an otherwise effective Ferret script on a list of pages, it stops altogether at every 404.
Is there a DOCUMENT_EXISTS or anything that would help the script go on?
There are several things not accurately documented/outdated:
-v2
is used the examples but does not work# duckduckgo not supported
although it is in the list of supported search engines- To get a list of all search engines
--config
is suggested but that just fails
-
Updated
Jul 15, 2020 - PHP
-
Updated
Jul 6, 2020 - PHP
-
Updated
Jul 11, 2020
-
Updated
Jun 29, 2018 - Python
-
Updated
Jan 4, 2018 - Python
-
Updated
Oct 22, 2019 - Jupyter Notebook
-
Updated
Feb 29, 2020 - Ruby
-
Updated
Jun 19, 2020 - JavaScript
-
Updated
Oct 12, 2019 - Python
-
Updated
Jun 12, 2020 - Go
-
Updated
Jul 14, 2020 - Python
-
Updated
Oct 22, 2019 - Python
-
Updated
Jul 16, 2020 - HTML
-
Updated
Jul 16, 2020 - HTML
Improve this page
Add a description, image, and links to the scraping topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the scraping topic, visit your repo's landing page and select "manage topics."
Description
When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''
errorAs I see, most of the people have this issue the other way around.
Steps to Reproduce
**Expected