Web scraping and collection of public web data are playing an increasingly important role in private sector decision-making processes.
Today, the alternative data industry is worth nearly $7 billion. Some experts agree that web scraping has yet to reach its true potential, but a recent Oxylabs survey found that over 52% of UK financial firms use automated processes. data is collected using Most (63%) of survey participants use alternative data to gain competitive business insights.
Despite the positive use of non-traditional data sources in business, the public sector and academia still lag behind. Legal opacity and complex public procurement procedures may be the main reasons for constraining the public sector, but academia affords far more freedom. So why do so many students and researchers on college campuses have only a vague understanding of web scraping possibilities and tools?
Web scraping for science
Analyzing big data from alternative sources can help test and validate existing hypotheses and generate new hypotheses. It provides a much broader and sometimes less biased perspective than traditional data sources. However, if you try to search for information related to web scraping for science, you’ll quickly discover that it’s mostly about data scientists and rarely spills over into other fields.
Despite the lack of awareness, the potential for alternative web data analysis in social, economic, and psychological research is limitless. For example, the Bank of Japan has actively adopted alternative data to inform monetary policy. Evaluate economic activity using mobility data such as recreational and retail trends based on nighttime population and credit card usage in specific areas of Tokyo.
During the COVID-19 pandemic, virology and psychology studies also gained valuable insights from alternative web data. While localized Google Search Trends were more accurate at predicting trends than other means, scraping Twitter public data was used to understand the attitudes and experiences of the general public. towards remote work. Other notable examples of the use of surrogate data in scientific research include studies of depression and personality based on public social media activity and studies of weight bias in reader comments below articles on obesity. increase.
The benefits of web scraping can be easily observed in marketing and e-commerce research. Scientists can automate price collection for specific commodities (electronics, housing, food, etc.) to calculate the consumer price index. Marketing researchers can track the same product sold under different terms (such as discounted prices) to estimate the impact of certain factors on unreasonable actors.
Finally, web scraping of public data is essential for artificial intelligence (AI) and machine learning (ML) research. AI and ML research is becoming very popular, and almost all large universities offer AI and ML related study programs. A common challenge faced by students in these programs is the lack of suitable datasets on which to train AI/ML algorithms. Published data scraping knowledge helps AI and ML students build high-quality datasets for more efficient machine learning.
investigative reporting
One area where public web data collection is inevitable is investigative journalism and political research. This kind of research relies critically on unbiased, niche data whose full complexity is rarely available in traditional data sources.
Investigative journalists and political scientists have used scrapers to range from tracking lobbyist influence by examining visitor logs from government buildings, to banning political ads and extremism on public social media platforms and forums. A wide range of issues can be studied, up to sectarian surveillance. It can be argued that web scraping is important for solving social problems, and thus for the functioning of democracies themselves and the rule of law.
consciousness gap
Web scraping is not a panacea for all scientific problems. It is of little use for physical and life science experiments, but it can open the holy grail of data for social, economic, political, and possibly clinical research. Automating big data collection is a long-awaited breakthrough for many scientists. However, it suffers from some misconceptions.
In the social sciences, scholars sometimes turn to experimental and survey data simply because this kind of evidence is easier to collect than collecting web data. Without formal training in web scraping, when students try to find the information they need online, they usually resort to manual data entry (a brilliant copy-and-paste technique), which takes time. time consuming and error-prone.
Common sources of academic research data are large databases owned by public and government agencies, and datasets provided by companies. Unfortunately, the simplicity of this method comes at a price. Government data takes time to collect, can get outdated quickly, and the same data points are (over)analyzed by thousands of scientists, rarely yielding fresh insights. is not. Data provided by private bodies may be biased. If the information is highly sensitive, companies may require to see the final results of the study, often resulting in so-called results reporting bias.
The myriad sources of free alternative data on the web open up possibilities for conducting original research that would otherwise not be possible. It’s like having an infinite dataset that can be updated with almost any information. Web scraping certainly requires specific knowledge, but today’s data collection solutions allow users to extract large amounts of alternative data with just basic programming skills. The ability to return data in real time makes scientific predictions more accurate, as opposed to traditional data collection methods which often introduce significant time lags.
It’s important to note that academics rarely have a good reason (both in terms of time and resources) to build their own data scrapers and parsers from scratch. Third-party vendors can easily handle proxy management, CAPTCHA resolution, and building your own fingerprinting and analysis pipelines, so scientists can spend more time analyzing and researching data.
Fear of Legal Uncertainty
There are various legal concerns about web scraping that discourage some researchers from leveraging public big data in their research. Since the industry is relatively new and has a wide range of players, there was some really unprofessional activity going on. However, digital tools can be used for both positive and negative purposes.
There is nothing inherently unethical about web scraping as it simply automates activities that humans would do manually. We all know the most famous web scraper, Googlebot, and rely on it every day. Web scraping is also widely used in e-commerce. For example, a large flight comparison website scrapes thousands of airline sites to gather public price data. The public’s web data-gathering technology makes it possible to get the best deals on trips to New York.
Because web scraping carries a certain amount of risk, scholars often choose to abandon web scraping altogether and go back to traditional data sources, or scrape here and there and hope no one suspects it. I have. The best way out of this ambiguity is to consult legal counsel before embarking on any large-scale data collection project. Answering the following questions may also help researchers assess possible risks:
- Is the public data compiled from human subjects? If yes, could it be subject to privacy laws (e.g. GDPR)?
- Does your website offer an API?
- Is web crawling or scraping prohibited by the website’s terms of service?
- Is the website data expressly copyrighted or subject to intellectual property rights?
- Is website data paid (requires subscription)?
- Is the required data locked after login?
- Does your project involve illegal or unauthorized use of data?
- Have you read the robots.txt file and adjusted the scraper accordingly?
- Can crawling or scraping seriously damage your website or the server hosting your website?
- Can scraping or crawling significantly affect the quality of service (such as speed) of the target website?
To promote ethical data collection practices and industry-wide standards, Oxylabs collaborated with other prominent DaaS companies to establish the Ethical Web Data Collection Initiative. The consortium aims to build trust in web scraping and educate the broader technology community about the possibilities of big data.
Project 4β for free web data
The awareness gap around web scraping is probably the single main reason why academia isn’t taking advantage of this technology. To fill this gap and enable academics to collect big data using his web scraping tools, Oxylabs has launched a free effort called “Project4β”. The initiative aims to transfer Oxylabs’ years of accumulated technical expertise and grant universities and NGOs free access to data scraping tools to support critical research on big data. Project 4β is also a safe space for academics to discuss what conduct is appropriate and ethical according to the legal precedents that have formed over the past two decades.
Through “Project 4β,” Oxylabs has already partnered with professors and students at the University of Michigan, Northwestern University, and CODE (University of Applied Sciences) to share their knowledge on the challenges of ethical web scraping. Some of the educational resources offered are now integrated into postgraduate courses.
Over the last few years, Oxylabs has also been active in pushing the forefront of web scraping technology through AI and ML powered solutions. To facilitate know-how sharing, the company has established an AI and ML Advisory Board that includes five of his prominent academic and industry leaders. More active collaboration with academics will open up a wider range of web scraping possibilities to address important social issues.
lastly
Web scraping has yet to catch the attention of the public and academia. However, with the vast amount of web data growing exponentially every year, big data analytics will gradually become an inevitable part of scientific research. Familiarizing students with the practice of web scraping should become the norm, as it is now routine to teach his SPSS fundamentals on social science campuses.
Sure, web scraping comes with certain risks and ethical considerations, but so does scientific experimentation in a lab. Although organizations should always consult legal counsel before scraping, there are industry best practices that minimize most of the risks associated with collecting public web data.
About the author

Juras Juršėnas is the COO of Oxylabs.io. Oxylabs’ mission is to ensure that all companies, large and small, have access to publicly available big data. We believe public data collection is critical to the success of any company. We treat our clients as partners and ensure that both parties get the maximum benefit from this interaction. Our clients choose us because we provide the highest quality and best proxy to help them with market research, ad verification, brand protection, travel price aggregation, SEO monitoring, price intelligence and more. .
Featured image: ©A_B_C
