Advanced SQL Techniques for Data Analysts
Dive deep into advanced SQL concepts including window functions, CTEs, and query optimization.
Advanced SQL Techniques for Data Analysts
SQL is the backbone of data analysis, but mastering its advanced features can dramatically improve your analytical capabilities. This comprehensive guide covers sophisticated SQL techniques that will elevate your data analysis skills and make you a more effective analyst.
Window Functions: The Game Changer
Window functions perform calculations across a set of table rows related to the current row, without collapsing the result set like GROUP BY would. They're essential for advanced analytics.
Basic Window Function Syntax
SELECT
column1,
column2,
WINDOW_FUNCTION() OVER (
[PARTITION BY column] -- Divide data into groups
[ORDER BY column] -- Define order within partitions
[ROWS/RANGE frame_specification] -- Define window frame
) AS window_result
FROM table_name;
Ranking Functions
-- Sample sales data
WITH sales_data AS (
SELECT 'John' as salesperson, 'Q1' as quarter, 15000 as sales_amount, 'North' as region
UNION ALL SELECT 'Jane', 'Q1', 18000, 'South'
UNION ALL SELECT 'Bob', 'Q1', 12000, 'East'
UNION ALL SELECT 'Alice', 'Q1', 20000, 'West'
UNION ALL SELECT 'John', 'Q2', 16000, 'North'
UNION ALL SELECT 'Jane', 'Q2', 19000, 'South'
UNION ALL SELECT 'Bob', 'Q2', 14000, 'East'
UNION ALL SELECT 'Alice', 'Q2', 22000, 'West'
)
SELECT
salesperson,
quarter,
region,
sales_amount,
-- Rank within each quarter (gaps in ranking)
RANK() OVER (PARTITION BY quarter ORDER BY sales_amount DESC) as rank_in_quarter,
-- Dense rank (no gaps in ranking)
DENSE_RANK() OVER (PARTITION BY quarter ORDER BY sales_amount DESC) as dense_rank,
-- Row number (unique sequential number)
ROW_NUMBER() OVER (PARTITION BY quarter ORDER BY sales_amount DESC) as row_num,
-- Percentile rank (0 to 1)
PERCENT_RANK() OVER (PARTITION BY quarter ORDER BY sales_amount) as percentile_rank,
-- Ntile for quartiles
NTILE(4) OVER (PARTITION BY quarter ORDER BY sales_amount) as quartile
FROM sales_data
ORDER BY quarter, sales_amount DESC;
Analytical Functions
-- Advanced analytical functions
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(sales_amount) as monthly_revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
month,
monthly_revenue,
-- Running total
SUM(monthly_revenue) OVER (ORDER BY month) as running_total,
-- Moving average (3-month)
AVG(monthly_revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as moving_avg_3m,
-- Previous month comparison
LAG(monthly_revenue, 1) OVER (ORDER BY month) as prev_month,
-- Next month (for forecasting validation)
LEAD(monthly_revenue, 1) OVER (ORDER BY month) as next_month,
-- Month-over-month growth
(monthly_revenue - LAG(monthly_revenue, 1) OVER (ORDER BY month)) /
LAG(monthly_revenue, 1) OVER (ORDER BY month) * 100 as mom_growth_pct,
-- Year-over-year comparison
LAG(monthly_revenue, 12) OVER (ORDER BY month) as same_month_last_year,
-- First and last values in the dataset
FIRST_VALUE(monthly_revenue) OVER (ORDER BY month) as first_month_sales,
LAST_VALUE(monthly_revenue) OVER (
ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as last_month_sales
FROM monthly_sales
ORDER BY month;
Common Table Expressions (CTEs)
CTEs make complex queries more readable and maintainable by breaking them into logical steps.
Recursive CTEs
-- Organizational hierarchy analysis
WITH RECURSIVE employee_hierarchy AS (
-- Base case: top-level managers
SELECT
employee_id,
name,
manager_id,
title,
1 as level,
CAST(name AS VARCHAR(1000)) as hierarchy_path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: employees with managers
SELECT
e.employee_id,
e.name,
e.manager_id,
e.title,
eh.level + 1,
eh.hierarchy_path || ' -> ' || e.name
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT
level,
COUNT(*) as employees_at_level,
STRING_AGG(name, ', ') as employee_names
FROM employee_hierarchy
GROUP BY level
ORDER BY level;
Complex Multi-CTE Analysis
-- Comprehensive customer analysis
WITH customer_metrics AS (
-- Calculate customer lifetime metrics
SELECT
customer_id,
COUNT(*) as total_orders,
SUM(order_amount) as total_spent,
AVG(order_amount) as avg_order_value,
MAX(order_date) as last_order_date,
MIN(order_date) as first_order_date,
COUNT(DISTINCT DATE_TRUNC('month', order_date)) as active_months
FROM orders
GROUP BY customer_id
),
customer_segments AS (
-- Segment customers based on RFM analysis
SELECT
customer_id,
total_orders,
total_spent,
avg_order_value,
active_months,
-- Recency (days since last order)
CURRENT_DATE - last_order_date as days_since_last_order,
-- Frequency score
CASE
WHEN total_orders >= 10 THEN 'High'
WHEN total_orders >= 5 THEN 'Medium'
ELSE 'Low'
END as frequency_score,
-- Monetary score
CASE
WHEN total_spent >= 1000 THEN 'High'
WHEN total_spent >= 500 THEN 'Medium'
ELSE 'Low'
END as monetary_score,
-- Customer segment
CASE
WHEN total_spent >= 1000 AND total_orders >= 5 THEN 'VIP'
WHEN total_spent >= 500 OR total_orders >= 3 THEN 'Regular'
WHEN CURRENT_DATE - last_order_date <= 90 THEN 'New'
ELSE 'At Risk'
END as customer_segment,
-- Activity status
CASE
WHEN CURRENT_DATE - last_order_date <= 30 THEN 'Active'
WHEN CURRENT_DATE - last_order_date <= 90 THEN 'At Risk'
ELSE 'Inactive'
END as activity_status
FROM customer_metrics
),
segment_analysis AS (
-- Analyze segments
SELECT
customer_segment,
activity_status,
COUNT(*) as customer_count,
AVG(total_spent) as avg_total_spent,
AVG(avg_order_value) as avg_order_value,
AVG(total_orders) as avg_total_orders,
AVG(active_months) as avg_active_months
FROM customer_segments
GROUP BY customer_segment, activity_status
)
SELECT
customer_segment,
activity_status,
customer_count,
ROUND(avg_total_spent, 2) as avg_total_spent,
ROUND(avg_order_value, 2) as avg_order_value,
ROUND(avg_total_orders, 1) as avg_total_orders,
ROUND(avg_active_months, 1) as avg_active_months,
-- Calculate segment value
ROUND(customer_count * avg_total_spent, 2) as segment_total_value,
-- Percentage of total customers
ROUND(100.0 * customer_count / SUM(customer_count) OVER (), 2) as pct_of_customers
FROM segment_analysis
ORDER BY segment_total_value DESC;
Advanced Analytical Patterns
Cohort Analysis
-- Customer cohort retention analysis
WITH customer_cohorts AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date)) as cohort_month
FROM orders
GROUP BY customer_id
),
cohort_data AS (
SELECT
c.cohort_month,
DATE_TRUNC('month', o.order_date) as order_month,
COUNT(DISTINCT o.customer_id) as customers
FROM customer_cohorts c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.cohort_month, DATE_TRUNC('month', o.order_date)
),
cohort_sizes AS (
SELECT
cohort_month,
COUNT(DISTINCT customer_id) as cohort_size
FROM customer_cohorts
GROUP BY cohort_month
)
SELECT
cd.cohort_month,
cd.order_month,
EXTRACT(MONTH FROM AGE(cd.order_month, cd.cohort_month)) as period_number,
cd.customers,
cs.cohort_size,
ROUND(100.0 * cd.customers / cs.cohort_size, 2) as retention_rate
FROM cohort_data cd
JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month
ORDER BY cd.cohort_month, cd.order_month;
Time Series Analysis
-- Advanced time series analysis with seasonality
WITH daily_sales AS (
SELECT
DATE(order_date) as sale_date,
SUM(order_amount) as daily_revenue,
COUNT(*) as daily_orders
FROM orders
GROUP BY DATE(order_date)
),
sales_with_trends AS (
SELECT
sale_date,
daily_revenue,
daily_orders,
-- Day of week analysis
EXTRACT(DOW FROM sale_date) as day_of_week,
TO_CHAR(sale_date, 'Day') as day_name,
-- Seasonal patterns
EXTRACT(MONTH FROM sale_date) as month_num,
EXTRACT(QUARTER FROM sale_date) as quarter_num,
-- Moving averages
AVG(daily_revenue) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as ma_7_day,
AVG(daily_revenue) OVER (
ORDER BY sale_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as ma_30_day,
-- Volatility measures
STDDEV(daily_revenue) OVER (
ORDER BY sale_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as volatility_30_day,
-- Trend detection
daily_revenue - LAG(daily_revenue, 7) OVER (ORDER BY sale_date) as wow_change,
daily_revenue - LAG(daily_revenue, 30) OVER (ORDER BY sale_date) as mom_change
FROM daily_sales
)
SELECT
sale_date,
daily_revenue,
day_name,
ROUND(ma_7_day, 2) as seven_day_avg,
ROUND(ma_30_day, 2) as thirty_day_avg,
ROUND(volatility_30_day, 2) as volatility,
ROUND(wow_change, 2) as week_over_week_change,
ROUND(mom_change, 2) as month_over_month_change,
-- Anomaly detection (simple version)
CASE
WHEN ABS(daily_revenue - ma_30_day) > 2 * volatility_30_day
THEN 'Anomaly'
ELSE 'Normal'
END as anomaly_flag
FROM sales_with_trends
WHERE sale_date >= CURRENT_DATE - INTERVAL '90 days'
ORDER BY sale_date DESC;
Query Optimization Techniques
Index Strategy
-- Comprehensive indexing strategy
-- Single column indexes for frequent WHERE clauses
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
CREATE INDEX idx_orders_status ON orders(status);
-- Composite indexes for common query patterns
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
CREATE INDEX idx_orders_date_amount ON orders(order_date, order_amount);
CREATE INDEX idx_products_category_price ON products(category, price);
-- Covering indexes (include frequently selected columns)
CREATE INDEX idx_orders_analysis ON orders(customer_id, order_date)
INCLUDE (order_amount, status, product_id);
-- Partial indexes for specific conditions
CREATE INDEX idx_active_customers ON customers(customer_id, last_order_date)
WHERE status = 'Active';
CREATE INDEX idx_high_value_orders ON orders(customer_id, order_date)
WHERE order_amount > 1000;
-- Functional indexes for computed columns
CREATE INDEX idx_orders_month ON orders(DATE_TRUNC('month', order_date));
CREATE INDEX idx_customers_full_name ON customers(LOWER(first_name || ' ' || last_name));
Query Performance Optimization
-- Performance optimization techniques
-- Use EXISTS instead of IN for better performance
-- Slow version
SELECT * FROM customers
WHERE customer_id IN (
SELECT customer_id FROM orders WHERE order_amount > 1000
);
-- Fast version
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_amount > 1000
);
-- Use LIMIT with ORDER BY efficiently
-- Add appropriate indexes for ORDER BY columns
SELECT customer_id, order_date, order_amount
FROM orders
ORDER BY order_date DESC, order_id DESC
LIMIT 100;
-- Optimize GROUP BY queries
-- Ensure GROUP BY columns are indexed
SELECT
customer_id,
COUNT(*) as order_count,
SUM(order_amount) as total_spent
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 5;
-- Use window functions instead of self-joins
-- Slow: self-join approach
SELECT
o1.customer_id,
o1.order_date,
o1.order_amount,
o2.prev_order_amount
FROM orders o1
LEFT JOIN orders o2 ON o1.customer_id = o2.customer_id
AND o2.order_date = (
SELECT MAX(order_date)
FROM orders o3
WHERE o3.customer_id = o1.customer_id
AND o3.order_date < o1.order_date
);
-- Fast: window function approach
SELECT
customer_id,
order_date,
order_amount,
LAG(order_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
) as prev_order_amount
FROM orders;
Advanced Data Quality and Validation
-- Comprehensive data quality checks
WITH data_quality_checks AS (
SELECT
'orders' as table_name,
'completeness' as check_type,
'customer_id' as column_name,
COUNT(*) as total_rows,
COUNT(customer_id) as non_null_rows,
ROUND(100.0 * COUNT(customer_id) / COUNT(*), 2) as completeness_pct
FROM orders
UNION ALL
SELECT
'orders',
'validity',
'order_amount',
COUNT(*),
COUNT(CASE WHEN order_amount > 0 THEN 1 END),
ROUND(100.0 * COUNT(CASE WHEN order_amount > 0 THEN 1 END) / COUNT(*), 2)
FROM orders
UNION ALL
SELECT
'orders',
'consistency',
'order_date',
COUNT(*),
COUNT(CASE WHEN order_date <= CURRENT_DATE THEN 1 END),
ROUND(100.0 * COUNT(CASE WHEN order_date <= CURRENT_DATE THEN 1 END) / COUNT(*), 2)
FROM orders
)
SELECT
table_name,
check_type,
column_name,
total_rows,
non_null_rows,
completeness_pct,
CASE
WHEN completeness_pct >= 95 THEN 'PASS'
WHEN completeness_pct >= 90 THEN 'WARNING'
ELSE 'FAIL'
END as quality_status
FROM data_quality_checks
ORDER BY table_name, check_type, column_name;
Conclusion
Mastering these advanced SQL techniques will significantly enhance your data analysis capabilities:
Key Takeaways
1. **Window Functions** are essential for advanced analytics and time series analysis
2. **CTEs** make complex queries readable and maintainable
3. **Proper indexing** is crucial for query performance
4. **Query optimization** techniques can dramatically improve execution time
5. **Data quality checks** should be built into your analytical workflows
Best Practices
- **Start simple** and gradually add complexity
- **Always consider performance** when writing queries
- **Document your queries** with clear comments
- **Test with realistic data volumes** before deploying to production
- **Monitor query performance** and optimize as needed
- **Use version control** for your SQL scripts
Next Steps
1. Practice these techniques with your own datasets
2. Learn about query execution plans and how to read them
3. Explore database-specific features (PostgreSQL, SQL Server, etc.)
4. Study advanced topics like query parallelization and partitioning
5. Consider learning about modern analytical databases (Snowflake, BigQuery, etc.)
Remember that the best SQL technique is the one that solves your specific problem efficiently and maintainably. Always consider readability and performance together, and don't hesitate to refactor queries as requirements evolve.
With these advanced techniques in your toolkit, you'll be able to tackle complex analytical challenges and extract deeper insights from your data.