#

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 8,399 public repositories matching this topic...

listenbrainz-server

metabrainz / listenbrainz-server

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

react python music typescript database web big-data spark listenbrainz-server

Updated Jul 16, 2024
Python

spotinst / bigdata-charts

spark netapp-public owned-by-sebastien-maintrot

Updated Jul 16, 2024
Smarty

apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery real-time sql database spark hive hadoop etl snowflake olap query-engine redshift dbt elt iceberg hudi delta-lake lakehouse

Updated Jul 16, 2024
Java

NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data spark gpu rapids

Updated Jul 16, 2024
Scala

ytsaurus / ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

sql big-data spark clickhouse distributed-database lakehouse olap-database ytsaurus

Updated Jul 16, 2024
C++

getyourguide / DDataFlow

A tool to help you to test and develop pyspark code with sampled and local data

python machine-learning spark

Updated Jul 16, 2024
HTML

apache / spark

Apache Spark - A unified analytics engine for large-scale data processing

python java r scala sql big-data spark jdbc

Updated Jul 16, 2024
Scala

apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

big-data spark flink real-time-analytics data-ingestion table-store paimon streaming-datalake

Updated Jul 16, 2024
Java

nessie

projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

git java data spark aws-lambda iceberg

Updated Jul 16, 2024
Java

appuv / ToyWeatherPrediction

A Toy Weather Prediction for predicting weather condition based on location and time

java machine-learning spark

Updated Jul 16, 2024
Java

moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

data-science spark record-linkage entity-resolution fuzzy-matching deduplication em-algorithm data-matching deduplicate-data duckdb uk-gov-data-science

Updated Jul 16, 2024
Python

commoncrawl / cc-index-table

Index Common Crawl archives in tabular format

sql spark columnar-storage aws-athena apache-parquet commoncrawl

Updated Jul 16, 2024
Java

mage-ai / mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

python data-science data machine-learning sql spark pipeline etl pipelines orchestration artificial-intelligence data-engineering data-integration dbt elt transformation data-pipelines reverse-etl

Updated Jul 16, 2024
Python

Aless19 / pyspark-dev

ezpz pyspark dev environment with docker

docker spark docker-compose pyspark jupyter-lab

Updated Jul 16, 2024
Shell

collabH / bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

kafka spark hive hadoop bigdata kudu hbase olap hdfs mapreduce flink debezium bigdatalearning hudi

Updated Jul 16, 2024
Shell

dashmug / glue-utils

Python library designed to enhance the developer experience when working with AWS Glue ETL and Python Shell jobs. It reduces boilerplate code, increases type safety, and improves IDE auto-completion, making Glue development easier and more efficient.

python aws spark etl pyspark data-engineering elt aws-glue

Updated Jul 16, 2024
Python

RoaringBitmap / RoaringBitmap

A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

java bitset spark roaring-bitmaps druid lucene roaringbitmap

Updated Jul 16, 2024
Java

qbeast-spark

Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

scala big-data spark sampling datasource spark-sql data-lakehouse

Updated Jul 16, 2024
Scala

AdaCore / RecordFlux

Formal specification and generation of verifiable binary parsers, message generators and protocol state machines

python parser spark communication-protocol formal-methods ada protocol-parser binary-parser formal-verification protocol-specification formal-specification

Updated Jul 16, 2024
Ada

apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

kubernetes sql spark hive hadoop jdbc thrift data-lake hacktoberfest spark-sql

Updated Jul 16, 2024
Scala

Created by Matei Zaharia

Released May 26, 2014

Followers: 420 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics