Skip to content

This repository hosts a comprehensive dataset of SQL queries paired with their natural language descriptions. The dataset includes over 1.1 million entries covering a wide range of SQL operations and schema manipulations. Additionally, it provides utilities for dataset decompression and management.

License

Notifications You must be signed in to change notification settings

Aditya-Codes-247/snake

Repository files navigation

Snake: A large scale SQL Query Dataset focussing on DDL and DCL commands

This repository contains a comprehensive dataset of SQL queries and their corresponding SQL statements, suitable for various applications in database management, query optimization, and machine learning.

Files

  • snake_dataset.tar.bz2: Compressed dataset file containing 1.1 million datapoints as JSON formatted tokenized SQL queries and statements.
  • decompress.py: Python script to decompress snake_dataset.tar.bz2 and extract snake_dataset.json.
  • sample_dataset.py: Python script to get a sample of 100000 datapoints from the snake_dataset.json file after extraction.
  • sample_dataset.tar.bz2: A sample from the main dataset (Can be extracted using decompress.py).
  • requirements.txt: File consisting of required libraries to sucessfully run the decompress.py and sample_dataset.py files.
  • README.md : README file for the dataset

Dataset Overview

The dataset comprises over 1.1 million entries, each including a textual SQL query description and its corresponding SQL statement. The queries cover a wide range of operations including table creation, data insertion, schema manipulation, and transaction management. These queries are corresponding to the following databases:

  • HR: ['Employees', 'Projects', 'Departments'],
  • Education: ['Courses', 'Students'],
  • Library: ['Books'],
  • eCommerce: ['Orders', 'Products', 'Suppliers', 'Customers'],
  • Finance: ['Invoices', 'Payments', 'Expenses', 'Budgets', 'Assets', 'Liabilities'],
  • Logistics: ['Shipments', 'Categories'],
  • Sales: ['Sales', 'Reviews', 'Campaigns', 'Promotions', 'Coupons'],
  • IT: ['Tasks', 'Assignments', 'Resources'],
  • Support: ['Feedback', 'Complaints'],
  • Events: ['Events', 'Locations', 'Schedules'],
  • Transport: ['Tickets', 'Flights'],
  • Hospitality: ['Hotels', 'Reservations'],
  • Membership: ['Memberships', 'Subscriptions'],
  • Legal: ['Contracts', 'Leases', 'Policies', 'Claims'],
  • Messaging: ['Messages', 'Notifications'],
  • Logs: ['Logs'],
  • Reports: ['Reports'],
  • Alerts: ['Alerts'],
  • Requests: ['Requests', 'Issues'],
  • Documents: ['Documents', 'Notes'],
  • Calendar: ['Calendars', 'Agendas'],
  • Widgets: ['Widgets'],
  • Profiles: ['Profiles'],
  • Jobs: ['Jobs'],
  • Social: ['Posts', 'Comments', 'Likes', 'Followers', 'Tags'],
  • Books: ['Authors', 'Genres'],
  • Monitoring: ['Audits'],
  • Actions: ['Actions', 'Errors', 'Warnings'],
  • Default: [] # For any table not covered above

Data Content And Format

{
        "query": "Show all grants for the user 'johnowens'.",
        "query_toks": [
            "SHOW",
            "GRANTS",
            "FOR",
            "'johnowens",
            "'",
            ";"
        ],
        "sql": "SHOW GRANTS FOR 'johnowens';",
        "question_toks": [
            "Show",
            "all",
            "grants",
            "for",
            "the",
            "user",
            "'johnowens",
            "'",
            "."
        ],
        "db_id": "Default",
        "qid": 92003752
    },
    {
        "query": "Create an index named hospital on column DownloadCount in table Reviews.",
        "query_toks": [
            "CREATE",
            "INDEX",
            "hospital",
            "ON",
            "Reviews",
            "(",
            "DownloadCount",
            ")",
            ";"
        ],
        "sql": "CREATE INDEX hospital ON Reviews(DownloadCount);",
        "question_toks": [
            "Create",
            "an",
            "index",
            "named",
            "hospital",
            "on",
            "column",
            "DownloadCount",
            "in",
            "table",
            "Reviews",
            "."
        ],
        "db_id": "Sales",
        "qid": 62944826
    }

Usage

Decompressing snake_dataset.tar.bz2

To decompress snake_dataset.tar.bz2 and extract snake_dataset.json, use the provided decompress.py script:

python decompress.py

Sampling from the dataset

To sample a part from the dataset and use the sample to train model, use the provided sample_dataset.py script:

python sample_dataset.py

Contributions

Contributions to improve the dataset or add new features are welcome! To contribute, please fork this repository, make your changes, and submit a pull request.

About

This repository hosts a comprehensive dataset of SQL queries paired with their natural language descriptions. The dataset includes over 1.1 million entries covering a wide range of SQL operations and schema manipulations. Additionally, it provides utilities for dataset decompression and management.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages