the portable Python dataframe library
-
Updated
Jul 16, 2024 - Python
the portable Python dataframe library
ezpz pyspark dev environment with docker
Python library designed to enhance the developer experience when working with AWS Glue ETL and Python Shell jobs. It reduces boilerplate code, increases type safety, and improves IDE auto-completion, making Glue development easier and more efficient.
Open Targets python framework for post-GWAS analysis
pyspark methods to enhance developer productivity 📣 👯 🎉
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
This Capstone Project includes an End to End Data Engineering Pipeline right from Ingesting the data from HTTPs server to cleaning and transforming the data in Azure Databricks and finally reporting the data on Power BI Desktop
A searchable collection of useful little pieces of code
Simple and Distributed Machine Learning
Data Engineering examples covering Airflow and Mage for workflows; dbt for BigQuery, Redshift, ClickHouse; Spark and Kafka for Batch/Streaming Processing
Possibly the fastest DataFrame-agnostic quality check library in town.
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
This project aims to enhance the accuracy and efficiency of stock market predictions by employing a sophisticated machine learning methodology. This project leverages the power of PySpark, a robust framework for distributed data processing, to handle large datasets and perform complex computations.
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
State of the Art Natural Language Processing
This project demonstrates how to use PySpark for predicting customer churn. The dataset contains various parameters related to customers of a telecom company. The main objective is to build a machine learning model using PySpark to predict whether a customer will churn.
Play around with Databricks and PySpark
Add a description, image, and links to the pyspark topic page so that developers can more easily learn about it.
To associate your repository with the pyspark topic, visit your repo's landing page and select "manage topics."