Collecting Web Data - APIs & Web Scraping

APIs

Understanding APIs

An API ("application programming interface") is essentially a structured and pre-defined set of programmatic functions and operations that can be used to access data and use procedures in a particular context.  Many companies and organizations with large and complex software and data environments will use APIs internally to enable employees to access and use data efficiently.  In many cases, APIs are strictly for internal use only and people outside the company or organization have no ability to use the API for any purpose.

However, some companies provide access to their API to external users.  Sometimes API access is free (e.g. Reddit), sometimes it costs money.  Additionally, some companies offer a free version of their API with limited functionality as well as a paid version of their API will more robust functionality (e.g. Twitter's "Standard API" vs "Premium API").  Some APIs are relatively "easy" to use in a technical sense, while others are much more challenging.

APIs over Web Scraping!

In brief, if you have a choice between using an API vs. web scraping in order to interact with a website and collect data, ALWAYS START WITH THE API!!!  There are two key reasons for this:

  • Cost - Web scraping is computationally more intensive for both the webhost and you, the end user.  This means more time and resources are spent by both parties, ultimately translating to a higher cost.  Comparatively, using an API is much cheaper overall for everyone involved.
  • Sustainability (Environmental) - Ignoring direct monetary costs, using APIs instead of web scrapers also has an impact on the environment.  Ultimately, APIs are a more computationally and energy efficient way of transferring data, which directly translates to the amount of power needed on both ends to keep computers running.
  • Sustainability (Human Labor) - Websites are constantly changing and updating their layouts, and how information is displayed to users.  When a website updates or refreshes its design, you will likely have to reprogram a web scraping script from scratch.  APIs on the other hand tend to be much more stable for a longer period of time regardless of how the public-facing website changes.

APIs as Described by Tom Scott