Collecting Web Data - APIs & Web Scraping

Issues to Consider

undefinedNo clear answers; lots of questions:

Web scraping is an extremely amongst researchers and web developers.  The best example may be Google Search.  When you use Google to find information, you are (in highly over simplified terms) not actually searching the "live" internet, but rather a database of webpages that Google has mapped.  If Google is allowed to do it, why can't you!?

Before assuming that your project is both legal and ethical, you need to ask yourself a lot of questions regarding the nature, scope, and purpose of your web scraping activities.  If your web scraping project is focused on collecting data that is in the public-domain from publicly available websites and individual datum do not connect directly to individual humans in the real world and your web scraping activities represent minimal technical burden on the part of the website owner, then you are in the clear!

However, in truth, most web scraping research projects are not so clearly defined.  While you will have to make the determination yourself regarding whether your project is both legal and ethical, the questions below are meant to prompt the kind of thinking that may not be immediately obvious the first time you start scraping the internet for data. 

Research Ethics

  • Is the data you are collecting potentially sensitive information? 
  • If you are scraping user-comments from a social media website, are the users aware that their comments are visible to you or others?
  • Are the users fully or partially aware of how their comments and data may be used?
  • Do the users have an expectation of anonymity or confidentiality?
  • Do the users represent marginalized or at-risk group?
  • Does your research pose any form of potential risk to the users who supplied the data you are using?

Public vs. Protected Content

  • Are the webpages you are collecting data from freely and publicly visible? OR...
  • Are the webpages you are collecting data password-protected, requiring you to log into the website?
  • If webpages and content are password-protected, does the website require you to adhere to a "Terms of Service", "Terms of Use" or other type of agreement in order to access and use the website?
    • Often these agreements explicitly forbid systematic web-scraping activities

Copyright & Commercial Activity

  • Are you violating copyright as part of your overall as part of your web-scraping activities?
  • Are you reproducing the data or contents of webpages on your own website or in another medium?
  • If you are reproducing or embedding the content in some way, do you have the site owner's permission?

Sustainability

  • Are you systematically collecting large volumes of webpages at a high rate from the target website?
  • Are you systematically collecting on a repeating schedule at a rapid rate?
  • Are you collecting webpages from the website in such a way that poses commercial and/or technical risk to the technical operations of the target website?

Resources

Sometimes it is not clear whether or not an individual web scraping project is fully legal or ethical.  For GSU researchers, there are two key resources available for helping make this determination.  For legal questions and guidance, GSU's Legal Services office are the best people to contact.  For issues and questions concerning research ethics, especially when research involves collecting online data about people (e.g. social media), then GSU's IRB office is the best point of contact.