How to Scrape Company Salary Data with Python

How to Scrape Company Salary Data with Python
Scraping

Scraping salary data helps businesses stay competitive by understanding compensation benchmarks across industries. It also empowers job seekers to make informed career decisions based on real-world salary insights. Comparably is a valuable resource for salary transparency, providing insights into compensation across different companies and departments. However, extracting this data programmatically can be challenging due to dynamic content loading and anti-bot measures.

In this tutorial, we'll show you how to scrape Comparably's salary data using Python and Piloterr's powerful Website Rendering API.

Why use Piloterr for Comparably Scraping?

Comparably uses Angular and loads content dynamically, making traditional scraping methods ineffective. Piloterr's browser rendering API solves this by:

  • Rendering JavaScript : Fully executes the Angular application
  • Bypassing protection : Handles Cloudflare and other anti-bot measures
  • Browser instructions : Allows scrolling to trigger lazy-loaded content
  • Wait conditions : Ensures content is fully loaded before extraction

Prerequisites

Before starting, you'll need:

pip install requests beautifulsoup4 lxml

And a Piloterr API key - sign up at Piloterr.com

Step 1 : Fetch the rendered HTML

First, let's use Piloterr to get the fully rendered HTML of a company page:

import requests
import json
import re
from bs4 import BeautifulSoup

def fetch_comparably_data(company_name, api_key):
    """
    Fetch rendered HTML from Comparably company page using Piloterr
    """
    url = "https://piloterr.com/api/v2/website/rendering"
    
    headers = {
        "x-api-key": api_key,
        "Content-Type": "application/json"
    }
    
    # Browser instructions to ensure salary section loads
    browser_instructions = [
        {
            "type": "scroll_to_bottom",
            "duration": 15,
            "wait_time_s": 2
        }
    ]
    
    payload = {
        "query": f"https://www.comparably.com/companies/{company_name}",
        "wait_for": "#ng-state",  # Wait for Angular to initialize
        "browser_instructions": browser_instructions
    }
    
    response = requests.post(url, headers=headers, json=payload)
    
    if response.status_code == 200:
        return response.text
    else:
        raise Exception(f"Failed to fetch data: {response.status_code} - {response.text}")

This tells the browser to scroll_to_bottom of the page over a duration of 15 seconds, with a 2 seconds pause afterward. It ensures that all dynamic content - especially sections loaded on scroll, like salary data on Comparably - is fully rendered before capturing the page.

Step 2 : Complete scraping script

Here's the complete script that ties everything together :

import requests
import json
import re
from bs4 import BeautifulSoup

class ComparablyScraper:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://piloterr.com/api/v2/website/rendering"
    
    def scrape_company_salaries(self, company_name):
        """
        Complete scraping workflow for Comparably salary data
        """
        try:
            # Step 1: Fetch rendered HTML
            print(f"Fetching data for {company_name}...")
            html_content = self.fetch_comparably_data(company_name)
            
            # Step 2: Parse salary data
            print("Parsing salary information...")
            salary_data = self.parse_salary_section(html_content)
            
            if salary_data:
                return salary_data
            else:
                print("No salary section found")
                return None
                
        except Exception as e:
            print(f"Error scraping {company_name}: {str(e)}")
            return None
    
    def fetch_comparably_data(self, company_name):
        headers = {
            "x-api-key": self.api_key,
            "Content-Type": "application/json"
        }
        
        browser_instructions = [
            {
                "type": "scroll_to_bottom",
                "duration": 15,
                "wait_time_s": 2
            }
        ]
        
        payload = {
            "query": f"https://www.comparably.com/companies/{company_name}",
            "wait_for": "#ng-state",
            "browser_instructions": browser_instructions
        }
        
        response = requests.post(self.base_url, headers=headers, json=payload)
        
        if response.status_code == 200:
            return response.text
        else:
            raise Exception(f"API request failed: {response.status_code} - {response.text}")
    
    def parse_salary_section(self, html_content):
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Find salary section with multiple fallback strategies
        salary_section = soup.find('section', id='salaries')
        if not salary_section:
            salary_section = soup.find('section', id='"salaries"')
        if not salary_section:
            salary_section = soup.select_one('section[id*="salaries"]')
        if not salary_section:
            salary_section = soup.find('section', class_=lambda x: x and 'company-salaries' in ' '.join(x) if x else False)
        
        if not salary_section:
            return None
        
        # Check if there's an inner section with the actual content
        inner_section = (salary_section.find('section', class_=lambda x: x and any('section' in str(cls) for cls in x) if x else False) or
                        salary_section.find('section', class_=lambda x: x and any('company-salaries' in str(cls) for cls in x) if x else False))
        if inner_section:
            salary_section = inner_section
        
        salary_data = {}
        
        # Extract title and company name with multiple CSS class variants
        title_element = (salary_section.find('h2', class_='section-title') or 
                        salary_section.find('h2', class_='"section-title"') or
                        salary_section.find('h2', class_='\\"section-title\\"') or
                        salary_section.find('h2', class_=lambda x: x and any('section-title' in str(cls) for cls in x) if x else False))
        if title_element:
            title_text = title_element.get_text(strip=True)
            salary_data['title'] = title_text
            salary_data['company'] = title_text.replace(' Salaries', '').strip()
        
        # Extract subtitle
        subtitle_element = (salary_section.find('h3', class_='section-subtitle') or
                           salary_section.find('h3', class_='"section-subtitle"') or
                           salary_section.find('h3', class_='\\"section-subtitle\\"') or
                           salary_section.find('h3', class_=lambda x: x and any('section-subtitle' in str(cls) for cls in x) if x else False))
        if subtitle_element:
            salary_data['subtitle'] = subtitle_element.get_text(strip=True)
        
        # Extract compensation data with improved regex matching
        avg_compensation = (salary_section.find('div', class_='average-compensation') or
                           salary_section.find('div', class_='"average-compensation"') or
                           salary_section.find('div', class_='\\"average-compensation\\"') or
                           salary_section.find('div', class_=lambda x: x and any('average-compensation' in str(cls) for cls in x) if x else False))
        if avg_compensation:
            avg_text = avg_compensation.get_text(strip=True)
            amount_match = re.search(r'\$[\d,]+\.?\d*\*?', avg_text)
            salary_data['average_compensation'] = amount_match.group().replace('*', '') if amount_match else avg_text.split()[0]
        
        median_compensation = (salary_section.find('div', class_='median-compensation') or
                              salary_section.find('div', class_='"median-compensation"') or
                              salary_section.find('div', class_='\\"median-compensation\\"') or
                              salary_section.find('div', class_=lambda x: x and any('median-compensation' in str(cls) for cls in x) if x else False))
        if median_compensation:
            median_text = median_compensation.get_text(strip=True)
            amount_match = re.search(r'\$[\d,]+\.?\d*\*?', median_text)
            salary_data['median_compensation'] = amount_match.group().replace('*', '') if amount_match else median_text.split()[0]
        
        # Extract descriptions from section-text divs
        section_texts = (salary_section.find_all('div', class_='section-text') or
                        salary_section.find_all('div', class_='"section-text"') or
                        salary_section.find_all('div', class_='\\"section-text\\"') or
                        salary_section.find_all('div', class_=lambda x: x and any('section-text' in str(cls) for cls in x) if x else False))
        descriptions = []
        for text_div in section_texts:
            paragraph = text_div.find('p')
            if paragraph:
                descriptions.append(paragraph.get_text(strip=True))
            else:
                text = text_div.get_text(strip=True)
                if text and len(text) > 10:
                    descriptions.append(text)
        salary_data['descriptions'] = descriptions
        
        # Extract salary ranges with percentages and percentiles
        salary_ranges = []
        
        chart_labels = (salary_section.find('div', class_='vertical-bar-label') or
                       salary_section.find('div', class_='"vertical-bar-label"') or
                       salary_section.find('div', class_='\\"vertical-bar-label\\"') or
                       salary_section.find('div', class_=lambda x: x and any('vertical-bar-label' in str(cls) for cls in x) if x else False))
        amounts = []
        if chart_labels:
            ranges = chart_labels.find_all('span')
            amounts = [span.get_text(strip=True) for span in ranges if span.get_text(strip=True)]
        
        # Extract percentile data using multiple regex patterns
        percentages_and_percentiles = []
        
        patterns = [
            r'class="ghvb_percent_label">([^<]+)</span>.*?data-testid="dt-tooltip-content"[^>]*>([^<]+)</span>',
            r"class='ghvb_percent_label'>([^<]+)</span>.*?data-testid='dt-tooltip-content'[^>]*>([^<]+)</span>",
            r'class=\\"ghvb_percent_label\\">([^<]+)</span>.*?data-testid=\\"dt-tooltip-content\\"[^>]*>([^<]+)</span>',
            r"class=\\'ghvb_percent_label\\'>([^<]+)</span>.*?data-testid=\\'dt-tooltip-content\\'[^>]*>([^<]+)</span>",
            r'ghvb_percent_label[^>]*>([^<]+)</span>.*?dt-tooltip-content[^>]*>([^<]+)</span>'
        ]
        
        matches = []
        for pattern in patterns:
             matches = re.findall(pattern, html_content, re.DOTALL)
             if matches:
                 break
        
        for percentage, percentile in matches:
            percentages_and_percentiles.append({
                'percentage': percentage.strip(),
                'percentile': percentile.strip()
            })
        
        # Fallback method for extracting percentile data
        if not percentages_and_percentiles:
            for class_variant in ['ghvb_percent', '"ghvb_percent"', '\\"ghvb_percent\\"']:
                percentile_bars = soup.find_all('div', class_=class_variant)
                if percentile_bars:
                    break
            
            for bar in percentile_bars:
                percent_label = None
                for label_class in ['ghvb_percent_label', '"ghvb_percent_label"', '\\"ghvb_percent_label\\"']:
                    percent_label = bar.find('span', class_=label_class)
                    if percent_label:
                        break
                
                parent_tooltip = bar.find_parent('comparably-companies-ui-tooltip')
                tooltip = None
                if parent_tooltip:
                    tooltip = parent_tooltip.find('span', {'data-testid': 'dt-tooltip-content'})
                
                if percent_label and tooltip:
                    percentages_and_percentiles.append({
                        'percentage': percent_label.get_text(strip=True),
                        'percentile': tooltip.get_text(strip=True)
                    })
        
        # Combine amounts with percentile data
        for i, amount in enumerate(amounts):
            range_data = {'amount': amount}
            if i < len(percentages_and_percentiles):
                range_data.update(percentages_and_percentiles[i])
            salary_ranges.append(range_data)
        
        salary_data['salary_ranges'] = salary_ranges
        
        # Extract departments
        chips_container = (salary_section.find('ul', class_='chip-container') or
                          salary_section.find('ul', class_='"chip-container"') or
                          salary_section.find('ul', class_='\\"chip-container\\"') or
                          salary_section.find('ul', class_=lambda x: x and any('chip-container' in str(cls) for cls in x) if x else False))
        if chips_container:
            chip_elements = (chips_container.find_all('span', class_='chip') or
                           chips_container.find_all('span', class_='"chip"') or
                           chips_container.find_all('span', class_='\\"chip\\"') or
                           chips_container.find_all('span', class_=lambda x: x and any('chip' in str(cls) for cls in x) if x else False))
            salary_data['departments'] = [chip.get_text(strip=True) for chip in chip_elements]
        
        return salary_data

# Usage example
if __name__ == "__main__":
    API_KEY = "your-piloterr-api-key"
    scraper = ComparablyScraper(API_KEY)
    
    # Scrape Airbus salary data
    airbus_data = scraper.scrape_company_salaries("airbus")
    
    if airbus_data:
        print(json.dumps(airbus_data, indent=2))
    else:
        print("Failed to scrape data")

Example Output

When you run this script, you'll get structured data like:

{
  "title": "Airbus Salaries",
  "company": "Airbus",
  "subtitle": "How much does Airbus pay?",
  "average_compensation": "$58,437.11",
  "median_compensation": "$47,538.52",
  "descriptions": [
    "How much do people at Airbus get paid? See the latest salaries by department and job title. The average estimated annual salary, including base and bonus, at Airbus is $58,437.11, or $28 per hour, while the estimated median salary is $47,538.52, or $22 per hour.",
    "At Airbus, the highest paid job is a Director of Sales at $224,464.58 annually and the lowest is a Secretary at $300 annually. Average Airbus salaries by department include: Engineering at $52,911, Business Development at $102,777, Marketing at $34,662 and HR at $47,401. Half of Airbus salaries are above $47,538.52.",
    "132 employees at Airbus rank their Compensation in the Top 15% of similar sized companies in the US (based on 394 ratings) while 124 employees at Airbus rank their Perks And Benefits in the Top 15% of similar sized companies in the US (based on 125 ratings).",
    "Salaries contributed from Airbus employees include job titles like Director of HR and IT Manager. Comparably data has a total of 3 salary records from Airbus employees.",
    "Last updated on 41 days ago"
  ],
  "salary_ranges": [
    {
      "amount": "$15.3k",
      "percentage": "33%",
      "percentile": "1st Percentile"
    },
    {
      "amount": "$31.1k",
      "percentage": "49%",
      "percentile": "20th Percentile"
    },
    {
      "amount": "$47.0k",
      "percentage": "48%",
      "percentile": "40th Percentile"
    },
    {
      "amount": "$139.6k",
      "percentage": "100%",
      "percentile": "60th Percentile"
    },
    {
      "amount": "$224.5k",
      "percentage": "10%",
      "percentile": "80th Percentile"
    },
    {
      "amount": "$325.5k",
      "percentage": "0.1%",
      "percentile": "100th Percentile"
    }
  ],
  "departments": [
    "Admin",
    "Business Development",
    "Communications",
    "Customer Support"
  ]
}

This approach gives you reliable access to Comparably's salary data, making it perfect for compensation research, market analysis, or building salary comparison tools.