Scraping salary data helps businesses stay competitive by understanding compensation benchmarks across industries. It also empowers job seekers to make informed career decisions based on real-world salary insights. Comparably is a valuable resource for salary transparency, providing insights into compensation across different companies and departments. However, extracting this data programmatically can be challenging due to dynamic content loading and anti-bot measures.
In this tutorial, we'll show you how to scrape Comparably's salary data using Python and Piloterr's powerful Website Rendering API.
Why use Piloterr for Comparably Scraping?
Comparably uses Angular and loads content dynamically, making traditional scraping methods ineffective. Piloterr's browser rendering API solves this by:
- Rendering JavaScript : Fully executes the Angular application
- Bypassing protection : Handles Cloudflare and other anti-bot measures
- Browser instructions : Allows scrolling to trigger lazy-loaded content
- Wait conditions : Ensures content is fully loaded before extraction
Prerequisites
Before starting, you'll need:
pip install requests beautifulsoup4 lxml
And a Piloterr API key - sign up at Piloterr.com
Step 1 : Fetch the rendered HTML
First, let's use Piloterr to get the fully rendered HTML of a company page:
import requests
import json
import re
from bs4 import BeautifulSoup
def fetch_comparably_data(company_name, api_key):
"""
Fetch rendered HTML from Comparably company page using Piloterr
"""
url = "https://piloterr.com/api/v2/website/rendering"
headers = {
"x-api-key": api_key,
"Content-Type": "application/json"
}
# Browser instructions to ensure salary section loads
browser_instructions = [
{
"type": "scroll_to_bottom",
"duration": 15,
"wait_time_s": 2
}
]
payload = {
"query": f"https://www.comparably.com/companies/{company_name}",
"wait_for": "#ng-state", # Wait for Angular to initialize
"browser_instructions": browser_instructions
}
response = requests.post(url, headers=headers, json=payload)
if response.status_code == 200:
return response.text
else:
raise Exception(f"Failed to fetch data: {response.status_code} - {response.text}")
This tells the browser to scroll_to_bottom of the page
over a duration of 15
seconds, with a 2
seconds pause afterward. It ensures that all dynamic content - especially sections loaded on scroll, like salary data on Comparably - is fully rendered before capturing the page.
Step 2 : Complete scraping script
Here's the complete script that ties everything together :
import requests
import json
import re
from bs4 import BeautifulSoup
class ComparablyScraper:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://piloterr.com/api/v2/website/rendering"
def scrape_company_salaries(self, company_name):
"""
Complete scraping workflow for Comparably salary data
"""
try:
# Step 1: Fetch rendered HTML
print(f"Fetching data for {company_name}...")
html_content = self.fetch_comparably_data(company_name)
# Step 2: Parse salary data
print("Parsing salary information...")
salary_data = self.parse_salary_section(html_content)
if salary_data:
return salary_data
else:
print("No salary section found")
return None
except Exception as e:
print(f"Error scraping {company_name}: {str(e)}")
return None
def fetch_comparably_data(self, company_name):
headers = {
"x-api-key": self.api_key,
"Content-Type": "application/json"
}
browser_instructions = [
{
"type": "scroll_to_bottom",
"duration": 15,
"wait_time_s": 2
}
]
payload = {
"query": f"https://www.comparably.com/companies/{company_name}",
"wait_for": "#ng-state",
"browser_instructions": browser_instructions
}
response = requests.post(self.base_url, headers=headers, json=payload)
if response.status_code == 200:
return response.text
else:
raise Exception(f"API request failed: {response.status_code} - {response.text}")
def parse_salary_section(self, html_content):
soup = BeautifulSoup(html_content, 'html.parser')
# Find salary section with multiple fallback strategies
salary_section = soup.find('section', id='salaries')
if not salary_section:
salary_section = soup.find('section', id='"salaries"')
if not salary_section:
salary_section = soup.select_one('section[id*="salaries"]')
if not salary_section:
salary_section = soup.find('section', class_=lambda x: x and 'company-salaries' in ' '.join(x) if x else False)
if not salary_section:
return None
# Check if there's an inner section with the actual content
inner_section = (salary_section.find('section', class_=lambda x: x and any('section' in str(cls) for cls in x) if x else False) or
salary_section.find('section', class_=lambda x: x and any('company-salaries' in str(cls) for cls in x) if x else False))
if inner_section:
salary_section = inner_section
salary_data = {}
# Extract title and company name with multiple CSS class variants
title_element = (salary_section.find('h2', class_='section-title') or
salary_section.find('h2', class_='"section-title"') or
salary_section.find('h2', class_='\\"section-title\\"') or
salary_section.find('h2', class_=lambda x: x and any('section-title' in str(cls) for cls in x) if x else False))
if title_element:
title_text = title_element.get_text(strip=True)
salary_data['title'] = title_text
salary_data['company'] = title_text.replace(' Salaries', '').strip()
# Extract subtitle
subtitle_element = (salary_section.find('h3', class_='section-subtitle') or
salary_section.find('h3', class_='"section-subtitle"') or
salary_section.find('h3', class_='\\"section-subtitle\\"') or
salary_section.find('h3', class_=lambda x: x and any('section-subtitle' in str(cls) for cls in x) if x else False))
if subtitle_element:
salary_data['subtitle'] = subtitle_element.get_text(strip=True)
# Extract compensation data with improved regex matching
avg_compensation = (salary_section.find('div', class_='average-compensation') or
salary_section.find('div', class_='"average-compensation"') or
salary_section.find('div', class_='\\"average-compensation\\"') or
salary_section.find('div', class_=lambda x: x and any('average-compensation' in str(cls) for cls in x) if x else False))
if avg_compensation:
avg_text = avg_compensation.get_text(strip=True)
amount_match = re.search(r'\$[\d,]+\.?\d*\*?', avg_text)
salary_data['average_compensation'] = amount_match.group().replace('*', '') if amount_match else avg_text.split()[0]
median_compensation = (salary_section.find('div', class_='median-compensation') or
salary_section.find('div', class_='"median-compensation"') or
salary_section.find('div', class_='\\"median-compensation\\"') or
salary_section.find('div', class_=lambda x: x and any('median-compensation' in str(cls) for cls in x) if x else False))
if median_compensation:
median_text = median_compensation.get_text(strip=True)
amount_match = re.search(r'\$[\d,]+\.?\d*\*?', median_text)
salary_data['median_compensation'] = amount_match.group().replace('*', '') if amount_match else median_text.split()[0]
# Extract descriptions from section-text divs
section_texts = (salary_section.find_all('div', class_='section-text') or
salary_section.find_all('div', class_='"section-text"') or
salary_section.find_all('div', class_='\\"section-text\\"') or
salary_section.find_all('div', class_=lambda x: x and any('section-text' in str(cls) for cls in x) if x else False))
descriptions = []
for text_div in section_texts:
paragraph = text_div.find('p')
if paragraph:
descriptions.append(paragraph.get_text(strip=True))
else:
text = text_div.get_text(strip=True)
if text and len(text) > 10:
descriptions.append(text)
salary_data['descriptions'] = descriptions
# Extract salary ranges with percentages and percentiles
salary_ranges = []
chart_labels = (salary_section.find('div', class_='vertical-bar-label') or
salary_section.find('div', class_='"vertical-bar-label"') or
salary_section.find('div', class_='\\"vertical-bar-label\\"') or
salary_section.find('div', class_=lambda x: x and any('vertical-bar-label' in str(cls) for cls in x) if x else False))
amounts = []
if chart_labels:
ranges = chart_labels.find_all('span')
amounts = [span.get_text(strip=True) for span in ranges if span.get_text(strip=True)]
# Extract percentile data using multiple regex patterns
percentages_and_percentiles = []
patterns = [
r'class="ghvb_percent_label">([^<]+)</span>.*?data-testid="dt-tooltip-content"[^>]*>([^<]+)</span>',
r"class='ghvb_percent_label'>([^<]+)</span>.*?data-testid='dt-tooltip-content'[^>]*>([^<]+)</span>",
r'class=\\"ghvb_percent_label\\">([^<]+)</span>.*?data-testid=\\"dt-tooltip-content\\"[^>]*>([^<]+)</span>',
r"class=\\'ghvb_percent_label\\'>([^<]+)</span>.*?data-testid=\\'dt-tooltip-content\\'[^>]*>([^<]+)</span>",
r'ghvb_percent_label[^>]*>([^<]+)</span>.*?dt-tooltip-content[^>]*>([^<]+)</span>'
]
matches = []
for pattern in patterns:
matches = re.findall(pattern, html_content, re.DOTALL)
if matches:
break
for percentage, percentile in matches:
percentages_and_percentiles.append({
'percentage': percentage.strip(),
'percentile': percentile.strip()
})
# Fallback method for extracting percentile data
if not percentages_and_percentiles:
for class_variant in ['ghvb_percent', '"ghvb_percent"', '\\"ghvb_percent\\"']:
percentile_bars = soup.find_all('div', class_=class_variant)
if percentile_bars:
break
for bar in percentile_bars:
percent_label = None
for label_class in ['ghvb_percent_label', '"ghvb_percent_label"', '\\"ghvb_percent_label\\"']:
percent_label = bar.find('span', class_=label_class)
if percent_label:
break
parent_tooltip = bar.find_parent('comparably-companies-ui-tooltip')
tooltip = None
if parent_tooltip:
tooltip = parent_tooltip.find('span', {'data-testid': 'dt-tooltip-content'})
if percent_label and tooltip:
percentages_and_percentiles.append({
'percentage': percent_label.get_text(strip=True),
'percentile': tooltip.get_text(strip=True)
})
# Combine amounts with percentile data
for i, amount in enumerate(amounts):
range_data = {'amount': amount}
if i < len(percentages_and_percentiles):
range_data.update(percentages_and_percentiles[i])
salary_ranges.append(range_data)
salary_data['salary_ranges'] = salary_ranges
# Extract departments
chips_container = (salary_section.find('ul', class_='chip-container') or
salary_section.find('ul', class_='"chip-container"') or
salary_section.find('ul', class_='\\"chip-container\\"') or
salary_section.find('ul', class_=lambda x: x and any('chip-container' in str(cls) for cls in x) if x else False))
if chips_container:
chip_elements = (chips_container.find_all('span', class_='chip') or
chips_container.find_all('span', class_='"chip"') or
chips_container.find_all('span', class_='\\"chip\\"') or
chips_container.find_all('span', class_=lambda x: x and any('chip' in str(cls) for cls in x) if x else False))
salary_data['departments'] = [chip.get_text(strip=True) for chip in chip_elements]
return salary_data
# Usage example
if __name__ == "__main__":
API_KEY = "your-piloterr-api-key"
scraper = ComparablyScraper(API_KEY)
# Scrape Airbus salary data
airbus_data = scraper.scrape_company_salaries("airbus")
if airbus_data:
print(json.dumps(airbus_data, indent=2))
else:
print("Failed to scrape data")
Example Output
When you run this script, you'll get structured data like:
{
"title": "Airbus Salaries",
"company": "Airbus",
"subtitle": "How much does Airbus pay?",
"average_compensation": "$58,437.11",
"median_compensation": "$47,538.52",
"descriptions": [
"How much do people at Airbus get paid? See the latest salaries by department and job title. The average estimated annual salary, including base and bonus, at Airbus is $58,437.11, or $28 per hour, while the estimated median salary is $47,538.52, or $22 per hour.",
"At Airbus, the highest paid job is a Director of Sales at $224,464.58 annually and the lowest is a Secretary at $300 annually. Average Airbus salaries by department include: Engineering at $52,911, Business Development at $102,777, Marketing at $34,662 and HR at $47,401. Half of Airbus salaries are above $47,538.52.",
"132 employees at Airbus rank their Compensation in the Top 15% of similar sized companies in the US (based on 394 ratings) while 124 employees at Airbus rank their Perks And Benefits in the Top 15% of similar sized companies in the US (based on 125 ratings).",
"Salaries contributed from Airbus employees include job titles like Director of HR and IT Manager. Comparably data has a total of 3 salary records from Airbus employees.",
"Last updated on 41 days ago"
],
"salary_ranges": [
{
"amount": "$15.3k",
"percentage": "33%",
"percentile": "1st Percentile"
},
{
"amount": "$31.1k",
"percentage": "49%",
"percentile": "20th Percentile"
},
{
"amount": "$47.0k",
"percentage": "48%",
"percentile": "40th Percentile"
},
{
"amount": "$139.6k",
"percentage": "100%",
"percentile": "60th Percentile"
},
{
"amount": "$224.5k",
"percentage": "10%",
"percentile": "80th Percentile"
},
{
"amount": "$325.5k",
"percentage": "0.1%",
"percentile": "100th Percentile"
}
],
"departments": [
"Admin",
"Business Development",
"Communications",
"Customer Support"
]
}
This approach gives you reliable access to Comparably's salary data, making it perfect for compensation research, market analysis, or building salary comparison tools.