Snake: A large scale SQL Query Dataset focussing on DDL and DCL commands

This repository contains a comprehensive dataset of SQL queries and their corresponding SQL statements, suitable for various applications in database management, query optimization, and machine learning.

Files

snake_dataset.tar.bz2: Compressed dataset file containing 1.1 million datapoints as JSON formatted tokenized SQL queries and statements.
decompress.py: Python script to decompress snake_dataset.tar.bz2 and extract snake_dataset.json.
sample_dataset.py: Python script to get a sample of 100000 datapoints from the snake_dataset.json file after extraction.
sample_dataset.tar.bz2: A sample from the main dataset (Can be extracted using decompress.py).
requirements.txt: File consisting of required libraries to sucessfully run the decompress.py and sample_dataset.py files.
README.md : README file for the dataset

Dataset Overview

The dataset comprises over 1.1 million entries, each including a textual SQL query description and its corresponding SQL statement. The queries cover a wide range of operations including table creation, data insertion, schema manipulation, and transaction management. These queries are corresponding to the following databases:

HR: ['Employees', 'Projects', 'Departments'],
Education: ['Courses', 'Students'],
Library: ['Books'],
eCommerce: ['Orders', 'Products', 'Suppliers', 'Customers'],
Finance: ['Invoices', 'Payments', 'Expenses', 'Budgets', 'Assets', 'Liabilities'],
Logistics: ['Shipments', 'Categories'],
Sales: ['Sales', 'Reviews', 'Campaigns', 'Promotions', 'Coupons'],
IT: ['Tasks', 'Assignments', 'Resources'],
Support: ['Feedback', 'Complaints'],
Events: ['Events', 'Locations', 'Schedules'],
Transport: ['Tickets', 'Flights'],
Hospitality: ['Hotels', 'Reservations'],
Membership: ['Memberships', 'Subscriptions'],
Legal: ['Contracts', 'Leases', 'Policies', 'Claims'],
Messaging: ['Messages', 'Notifications'],
Logs: ['Logs'],
Reports: ['Reports'],
Alerts: ['Alerts'],
Requests: ['Requests', 'Issues'],
Documents: ['Documents', 'Notes'],
Calendar: ['Calendars', 'Agendas'],
Widgets: ['Widgets'],
Profiles: ['Profiles'],
Jobs: ['Jobs'],
Social: ['Posts', 'Comments', 'Likes', 'Followers', 'Tags'],
Books: ['Authors', 'Genres'],
Monitoring: ['Audits'],
Actions: ['Actions', 'Errors', 'Warnings'],
Default: [] # For any table not covered above

Data Content And Format

{
        "query": "Show all grants for the user 'johnowens'.",
        "query_toks": [
            "SHOW",
            "GRANTS",
            "FOR",
            "'johnowens",
            "'",
            ";"
        ],
        "sql": "SHOW GRANTS FOR 'johnowens';",
        "question_toks": [
            "Show",
            "all",
            "grants",
            "for",
            "the",
            "user",
            "'johnowens",
            "'",
            "."
        ],
        "db_id": "Default",
        "qid": 92003752
    },
    {
        "query": "Create an index named hospital on column DownloadCount in table Reviews.",
        "query_toks": [
            "CREATE",
            "INDEX",
            "hospital",
            "ON",
            "Reviews",
            "(",
            "DownloadCount",
            ")",
            ";"
        ],
        "sql": "CREATE INDEX hospital ON Reviews(DownloadCount);",
        "question_toks": [
            "Create",
            "an",
            "index",
            "named",
            "hospital",
            "on",
            "column",
            "DownloadCount",
            "in",
            "table",
            "Reviews",
            "."
        ],
        "db_id": "Sales",
        "qid": 62944826
    }

Usage

Decompressing `snake_dataset.tar.bz2`

To decompress snake_dataset.tar.bz2 and extract snake_dataset.json, use the provided decompress.py script:

python decompress.py

Sampling from the dataset

To sample a part from the dataset and use the sample to train model, use the provided sample_dataset.py script:

python sample_dataset.py

Contributions

Contributions to improve the dataset or add new features are welcome! To contribute, please fork this repository, make your changes, and submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
decomress.py		decomress.py
requirements.txt		requirements.txt
sample_dataset.py		sample_dataset.py
sample_dataset.tar.bz2		sample_dataset.tar.bz2
snake_dataset.tar.bz2		snake_dataset.tar.bz2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snake: A large scale SQL Query Dataset focussing on DDL and DCL commands

Files

Dataset Overview

Data Content And Format

Usage

Decompressing `snake_dataset.tar.bz2`

Sampling from the dataset

Contributions

About

Releases

Packages

Languages

License

Aditya-Codes-247/snake

Folders and files

Latest commit

History

Repository files navigation

Snake: A large scale SQL Query Dataset focussing on DDL and DCL commands

Files

Dataset Overview

Data Content And Format

Usage

Decompressing snake_dataset.tar.bz2

Sampling from the dataset

Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Decompressing `snake_dataset.tar.bz2`

Packages