This repository contains a comprehensive dataset of SQL queries and their corresponding SQL statements, suitable for various applications in database management, query optimization, and machine learning.
snake_dataset.tar.bz2
: Compressed dataset file containing 1.1 million datapoints as JSON formatted tokenized SQL queries and statements.decompress.py
: Python script to decompresssnake_dataset.tar.bz2
and extractsnake_dataset.json
.sample_dataset.py
: Python script to get a sample of 100000 datapoints from thesnake_dataset.json
file after extraction.sample_dataset.tar.bz2
: A sample from the main dataset (Can be extracted usingdecompress.py
).requirements.txt
: File consisting of required libraries to sucessfully run thedecompress.py
andsample_dataset.py
files.README.md
: README file for the dataset
The dataset comprises over 1.1 million entries, each including a textual SQL query description and its corresponding SQL statement. The queries cover a wide range of operations including table creation, data insertion, schema manipulation, and transaction management. These queries are corresponding to the following databases:
HR
: ['Employees', 'Projects', 'Departments'],Education
: ['Courses', 'Students'],Library
: ['Books'],eCommerce
: ['Orders', 'Products', 'Suppliers', 'Customers'],Finance
: ['Invoices', 'Payments', 'Expenses', 'Budgets', 'Assets', 'Liabilities'],Logistics
: ['Shipments', 'Categories'],Sales
: ['Sales', 'Reviews', 'Campaigns', 'Promotions', 'Coupons'],IT
: ['Tasks', 'Assignments', 'Resources'],Support
: ['Feedback', 'Complaints'],Events
: ['Events', 'Locations', 'Schedules'],Transport
: ['Tickets', 'Flights'],Hospitality
: ['Hotels', 'Reservations'],Membership
: ['Memberships', 'Subscriptions'],Legal
: ['Contracts', 'Leases', 'Policies', 'Claims'],Messaging
: ['Messages', 'Notifications'],Logs
: ['Logs'],Reports
: ['Reports'],Alerts
: ['Alerts'],Requests
: ['Requests', 'Issues'],Documents
: ['Documents', 'Notes'],Calendar
: ['Calendars', 'Agendas'],Widgets
: ['Widgets'],Profiles
: ['Profiles'],Jobs
: ['Jobs'],Social
: ['Posts', 'Comments', 'Likes', 'Followers', 'Tags'],Books
: ['Authors', 'Genres'],Monitoring
: ['Audits'],Actions
: ['Actions', 'Errors', 'Warnings'],Default
: [] # For any table not covered above
{
"query": "Show all grants for the user 'johnowens'.",
"query_toks": [
"SHOW",
"GRANTS",
"FOR",
"'johnowens",
"'",
";"
],
"sql": "SHOW GRANTS FOR 'johnowens';",
"question_toks": [
"Show",
"all",
"grants",
"for",
"the",
"user",
"'johnowens",
"'",
"."
],
"db_id": "Default",
"qid": 92003752
},
{
"query": "Create an index named hospital on column DownloadCount in table Reviews.",
"query_toks": [
"CREATE",
"INDEX",
"hospital",
"ON",
"Reviews",
"(",
"DownloadCount",
")",
";"
],
"sql": "CREATE INDEX hospital ON Reviews(DownloadCount);",
"question_toks": [
"Create",
"an",
"index",
"named",
"hospital",
"on",
"column",
"DownloadCount",
"in",
"table",
"Reviews",
"."
],
"db_id": "Sales",
"qid": 62944826
}
To decompress snake_dataset.tar.bz2
and extract snake_dataset.json
, use the provided decompress.py
script:
python decompress.py
To sample a part from the dataset and use the sample to train model, use the provided sample_dataset.py
script:
python sample_dataset.py
Contributions to improve the dataset or add new features are welcome! To contribute, please fork this repository, make your changes, and submit a pull request.