Josselin Liebe
November 12, 2024
•
8
min read
•
47
votes
•
Scraping
Approximately 20% of the websites you need to scrape use Cloudflare, a robust anti-bot protection system that can easily block you. Indeed is among the sites protected by Cloudflare's anti-bot system, featuring its well-known "Verify you are human" or "Additional Verification Required" challenge. In this article, we will explore possible solutions to bypass their anti-bot measures with the scraping and successfully scrape jobs page and company page.
Indeed is structured into several sections :
Data scraping, or "web scraping," refers to the automated extraction of data from a website through software or scripts. This process enables companies to collect large amounts of information quickly, which may include job listings, company details, and even user profiles. For example, data scraped from Indeed can be valuable for analytics, recruitment, and competitive research but must adhere to strict legal frameworks.
Data scraping allows businesses and researchers to gather valuable information from Indeed’s platform, such as job trends, salaries, skill requirements, and employer data. This data helps in making informed decisions, driving market research, and creating innovative services.
Companies use scraped data to track industry trends, analyze competitors, enhance recruitment strategies, and create job market insights. This data empowers organizations to optimize hiring practices, build data-driven products, and understand market demands more effectively. Scraping data from Indeed requires a strategic approach due to the platform’s structure and the protections it has in place, such as Cloudflare’s anti-bot measures. Understanding how Indeed is organized and how to bypass these security protocols will help you collect the data you need efficiently.
Cloudscraper
, FlareSolverr
, Cfscrape
, or other Cloudflare solvers can be advantageous. These tools emulate human browsing behaviors and assist the web scraper in overcoming CAPTCHA challenges and other bot detection mechanisms.Beautiful Soup
, Scrapy
, or Cheerio
, lxml
to develop your web scraper. These libraries assist in navigating Indeed's webpages, parsing the HTML, and extracting the desired data/text/images.Cloudflare provides content delivery and web security services, including its Web Application Firewall (WAF), which safeguards websites against threats like cross-site scripting (XSS), credential stuffing, and Distributed Denial of Service (DDoS) attacks.
A vital component of Cloudflare is the Bot Manager, designed to protect websites from malicious bot traffic. The Bot Manager identifies and mitigates bot attacks without disrupting legitimate users. However, Cloudflare considers any unknown or non-whitelisted bot traffic, such as web scrapers, to be malicious. Therefore, even legitimate scraping attempts may be blocked, leading to the denial of access to Cloudflare-protected websites.
These errors are often accompanied by a Cloudflare 403 Forbidden HTTP response status code, indicating that the request was blocked due to suspected bot activity. To bypass these protections, specific Cloudflare solvers or techniques such as rotating proxies, mimicking human behavior, or using headless browsers may be required.
The following code snippet shows an example of an HTTP request and parsing method intended to extract job data from Indeed using Python libraries such as httpx
and re
:
This request fails, as Indeed’s website employs anti-bot protections, notably through Cloudflare, which blocks HTTP requests that don’t simulate human behavior. Libraries like httpx
or requests
are generally ineffective against these protections. To bypass Cloudflare, you need tools such as headless browsers or dedicated web scraper APIs that can mimic human interactions more reliably.
While the techniques mentioned in this article can be helpful, they cannot guarantee success at all times due to Cloudflare frequently updating its security measures. The most reliable way to deal with Cloudflare is to use a web scraping API, like Piloterr. It handles all of Cloudflare's detection methods behind the scenes, allowing you to focus on your scraping logic without worrying about bypassing bot protection.
Piloterr works with all programming languages. You only need a single API call to bypass Cloudflare and retrieve the data you need.
To see how Piloterr works, let’s use it to access Indeed Jobs, a website heavily protected by Cloudflare.
Python code :
With this request, you can get all the jobs that have the keyword “Senior Java Developer” in the location “Berlin”.
Check out the documentation to see how to configure the scraping request. Simply paste the target URL, add a wait_in_seconds between 5-20 seconds. you'll be able to use a simple request HTTP to search for jobs (and bypass the Cloudflare anti-bot), scrape URLs and text without any headaches.
If you're interested in scraping company data on Indeed, Piloterr offers a dedicated web scraping API to make the process straightforward and efficient. By using this API, you can bypass Cloudflare’s protection seamlessly and obtain structured JSON data about companies on Indeed.
To retrieve company information for a specific company on Indeed, follow these steps:
Python code :
Response :
By using this endpoint, you save time, as the response is already structured in JSON, allowing for smooth integration with your scraping logic without needing to parse raw HTML.
Note: It doesn't contain URLs and Jobs, this API endpoint focuses on company information. Some fields in the JSON response may be null
if the information is not available or if Indeed has restricted access to certain data. Ensure your python code handles these cases to avoid potential errors in data processing.
Using this endpoint saves time, as the response is already structured in JSON, allowing smooth integration with your scraping logic without needing to parse raw HTML text. Refer to Piloterr’s documentation for additional options to optimize your requests, such as specifying the wait time in seconds, search parameters, or adjusting user-agent headers to improve response quality.
With Piloterr, you can also scrape job listings directly from company profiles on Indeed, such as this URL: indeed.com/cmp/Google/jobs. Indeed Job Scraper allows you to extract valuable job data, including job title, description text, company name, location, salary, ratings, employment type, and more.
Here are some valuable use cases:
1. Salary Analysis & Benchmarking / using the salary data from job listings, you can:
For example, from the data we can see Microsoft's Software Engineer salaries range significantly based on location and experience level.
2. Job Market Intelligence / the data provides insights into:
3. Career Path Planning / the structured job title data can be used to:
4. Company Culture Analysis / using the review and rating data :
5. Interview Preparation / the interview data provides:
6. Competitive Intelligence / companies can:
This data can be particularly valuable for HR professionals, job seekers, and business analysts looking to make data-driven decisions about employment and workforce trends.
While Google no longer offers access to cached pages, you can still view archived versions of many websites through services like WebCite and the Internet Archive. These sites provide snapshots of web pages, allowing you to access content from protected sites without directly visiting their domain or passing through Cloudflare’s CDN.
To use archives when other methods fail, here are a few steps to follow :
If these conditions are met, explore the archive of the target site to see if a cached version is accessible.
If you need to automate the process of retrieving job title suggestions related to "developer," you can use a simple script to interact with the Indeed endpoint for autocomplete suggestions. This can help you gather a list of relevant titles that are frequently associated with developer roles, providing insights into similar or related positions.
You can use the following Python script to scrape and parse the text, extracting only the relevant job title suggestions:
This script sends an HTTP request to the Indeed API and prints a list of suggested job titles related to "developer." This Indeed endpoint is not currently protected by Cloudflare, but it may become so.
Note: It is also possible to use the same approach to retrieve location suggestions from Indeed, providing a list of relevant cities. This can be particularly useful when developing a web application to help prevent null results for the client by populating search fields with valid options.
The legality of data scraping is governed by intellectual property and data protection laws. The Code of Intellectual Property regulates data extraction in terms of usage, quantity, and intent. Here's a summary of what is generally permitted:
Indeed's Terms of Service explicitly prohibit scraping activities for commercial use without authorization. They restrict the use of “bots, scripts, or APIs” to scrape data from their website, especially when the data is used for competitive purposes, profiling, or mass data collection.
Example clause: "You agree not to use any robot, spider, scraper, or other automated means to access the Indeed site for any purpose without Indeed's express written permission."
Violating these terms could result in legal action and hefty fines. Indeed retains the right to seek compensation for damages caused by unauthorized scraping, which can amount to significant financial and reputational losses for the offending company.
As of June 2023, Indeed offers a range of APIs for developers free of charge. However, these APIs are primarily intended for the hiring side of the platform. They are useful for integrating Indeed with applicant tracking systems, tracking applicant conversions, or scheduling interviews, but they are not designed for job search purposes.
Previously, the Publisher Jobs API (including the Get Job and Job Search functions) was available specifically for job searches, allowing users to gather data such as job titles, company names, description text, locations and posting times. Since these APIs were deprecated, users have turned to alternatives, like an Indeed scraper, to access similar job search data.
In conclusion, data scraper on Indeed enables access to a wealth of valuable information, including jobs, companies, locations, and other useful details. Through the methods outlined, including using scraping APIs like Piloterr, it’s possible to extract text data from a simple URL while bypassing protections like Cloudflare. This approach provides businesses with critical insights to enhance recruitment strategies, competitive analysis, and market trend studies. However, it's crucial to comply with Indeed's terms of service to ensure a lawful use of this data.
Interviews, tips, guides, industry best practices and news.