Resources

Tools I Use Daily for Data Engineering

These are the tools and platforms I actually use in my day-to-day work. I only recommend things I’ve personally tested in production. Some links below are affiliate links — they cost you nothing extra and help support this blog.

 

Data Platform & Processing

•       Databricks — My primary workspace for everything. Notebooks, Delta tables, scheduled jobs, Unity Catalog. If you’re doing data engineering at any scale, this is the platform.

•       Apache Spark — The engine under the hood. Most of my ETL runs on Spark SQL or PySpark.

•       dbt — For SQL-first transformation workflows. Great for teams that want version-controlled, testable SQL.

 

SQL & Database Tools

•       DBeaver — Free universal database client. I use it for ad-hoc queries and exploring schemas across multiple connections.

•       DataGrip — JetBrains’ database IDE. Paid but worth it if you write SQL all day. Smart autocomplete saves hours.

 

Data Quality & Testing

•       Great Expectations / Soda — For automated data quality checks. I run these as part of pipeline validation before downstream tables consume the data.

•       dbt tests — Built-in unique, not_null, accepted_values, and relationship tests. Simple but effective for catching issues early.

 

AI & Productivity

•       Claude (Anthropic) — My go-to AI for writing SQL, debugging pipeline logic, generating documentation, and drafting articles. Claude Code is particularly powerful for autonomous development tasks.

•       Cursor / GitHub Copilot — AI coding assistants for writing Python and SQL faster.

 

Pipeline Orchestration

•       Databricks Workflows — Native scheduling and orchestration. I use this for most production pipelines.

•       Apache Airflow — The industry standard for complex DAG-based orchestration. Steeper learning curve but incredibly flexible.

•       Prefect / Dagster — Modern Python-native alternatives to Airflow. Worth evaluating if you’re starting fresh.

 

Monitoring & Observability

•       Monte Carlo / Metaplane — Data observability platforms that detect anomalies, schema changes, and freshness issues automatically.