Collecting Web Data - APIs & Web Scraping

Getting Started in Python & R

BEAUTIFUL SOUP

BeautifulSoup4 (BS4) is the current version of one of the most popular Python modules used in web scraping.  Along with using the "Requests" module native to Python, BS4 provides a lot of easy-to-use tools for parsing, navigating, and extracting data from webpages.

Documentation & Official Guide: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

undefined

RVEST

Rvest is part of the Tidyverse collection of R packages.  Like BeautifulSoup4 for Python, Rvest provides an easy set of functions and tools for collecting, parsing, navigating, and extracting data from webpages.

Documentation & Official Guide: https://rvest.tidyverse.org/

undefined

Guides and Tutorials

Web Scraping Tutorials

You can find a LOT of guides, tutorials, and walkthroughs on the internet for setting up a web scraper on your own computer.  Below are a few examples of guides that may be useful to you.

NOTE: Some of these websites contain links and connections to promotional and paid content.  The links to these guides is not an endorsement of the websites products.  Furthermore, in the opinion of the author of this webpage, you DO NOT NEED to pay for anything.  If you use your critical thinking and research skills, you can find free guides, tutorials, and walkthrough all over the internet for just about everything you could ever want to do using Python or R.

Automated / Headless Browsers

SELENIUM WEB DRIVER

Official Documentation & Guide: https://www.selenium.dev/documentation/en/

Python Package: https://selenium-python.readthedocs.io/navigating.html

undefined

Additional Packages

In addition to Selenium, there are a variety of other tools and packages available in many different programming languages to assist with headless and automated web browsing.  Below are just a few examples related to Python and R.

Python

Although Selenium is the most popular tool within the Python community for headless and automated browsing, there are alternative tools available.  The following are just two examples:

R

The R community is relatively limited in the number of pre-built packages that are available for automated web browsing and using headless browsers.  You may need to do considerable research and testing to use these tools in R.  Below are links to just two examples pulled from the web, but you will want to do further research on your own!