The Wayback Machine - https://web.archive.org/web/20201013224234/https://github.com/gocolly/colly/pull/408
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup invalid xml characters before unmarshalling #408

Open
wants to merge 1 commit into
base: master
from

Conversation

@jredl-va
Copy link
Contributor

@jredl-va jredl-va commented Dec 1, 2019

@asciimoo for your consideration here is another PR related to cleanup of xml sitemaps. When unmarshalling xml in go I've run into several sites with bad unicode characters:

2019/11/30 18:29:56 Error from https://somesite.com/somethin.xml with status code 200: XML syntax error on line 1: illegal character code U+0003

This will cleanup the file before passing onto the xml parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

1 participant
You can’t perform that action at this time.