Whether you are a beginner in the field of data science or an expert who has worked on the trending projects of the domain, clean and authentic datasets are extremely crucial for the success of the outcome. With that in mind, we have sourced datasets on different domains to help you test your models and algorithms and build your skills.
Learn all about data visualisation and data mining.
Following are datasets on retail, healthcare, agricultural statistics, foreign investments, finance, and startup funding information. Budding data scientists and data science enthusiasts can use these datasets to practise and hone their skills. Each data set contains content clarification and attribute information so that it is easier for you to fit them into any analytical structure.
Check out data science courses.
The path to becoming an expert on data science is long and laborious. While understanding the latest trends is important in order to be at the top of your top, developing your own style is equally crucial to stay long in it. Use these following data sets to create projects and gain experience which you can showcase in your CV.
Datasets for Creating Projects of Data Science
Sr. No. | Domain | Dataset link | Description |
1. | Retail Analytics | Online Retail | Abstract: This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. Attribute Information: Invoice No: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation.StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. Description: Product (item) name. Nominal.Quantity: The quantities of each product (item) per transaction. Numeric.Invoice Date and time. Numeric, the day and time when each transaction was generated.UnitPrice: Unit price. Numeric, Product price per unit in sterling.CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.Country: Country name. Nominal, the name of the country where each customer resides. |
2. | Healthcare Analytics | Heart Diseases | This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Attribute Information:> 1. age> 2. sex> 3. chest pain type (4 values)> 4. resting blood pressure> 5. serum cholesterol in mg/dl> 6. fasting blood sugar > 120 mg/dl> 7. resting electrocardiographic results (values 0,1,2)> 8. maximum heart rate achieved> 9. exercise induced angina> 10. oldpeak = ST depression induced by exercise relative to rest> 11. the slope of the peak exercise ST segment> 12. number of major vessels (0-3) colored by flourosopy> 13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect |
3. | Environmental Analytics | Agriculture crop Production in India | This Dataset can solve the problems of various crops Cultivation/production in India. Attribute Information: crop:string, crop name Variety:string,crop subsidiary name state: string,Crops Cultivation/production Place Quantity:Integer,no of Quintals/Hectares production:Integer,no of years Production Season:DateTime,medium(no of days),long(no of days) Unit:String , Tons Cost:Integer, cost of cultivation and Production Recommended Zone:String ,place(State,Mandal,Village) |
4. | Investment Analytics | Foreign Direct Investment In India | To understand the Foreign direct investment in India for the last 17 years from 2000-01 to 2016-17. This dataset contains sector and financial year-wise data of FDI in India |
5. | Financial Analytics | Capitalization of top 500 companies in India | This data set has information on the market capitalisation of the top 50 companies in India. Serial NumberNameName of CompanyMar Cap – CroreMarket Capitalization in CroresSales Qtr – CroreQuarterly Sale in crores |
6. | Business Analytics | Indian Startup Funding | This dataset has funding information of the Indian startups from January 2015 to August 2017. It includes columns with the date funded, the city the startup is based out of, the names of the funders, and the amount invested (in USD). Sr NoDate ddmmyyyy Startup Vertical SubVertical City Location Investors xe2x80x99 Name Investment Type Amount in USD Remarks |
Some general datasets for Machine learning
Bank Marketing Data Set: The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit.
The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be subscribed or not.
This dataset contains 4 files.:
1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with fewer inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with fewer inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms
Parking Birmingham Data Set: This Data is collected from car parks in Birmingham that are operated by NCP from Birmingham City Council. UK Open Government Licence
Data Set Information:
The data is recorded on a daily basis from 8:00 to 16:30.This data set gives information about Occupancy rates on these car parks from 2016/10/04 to 2016/12/19
Attribute Information: The various attributes that are present in this dataset are:
- SystemCodeNumber: Car park ID
- Capacity: Car park capacity
- Occupancy: Car park occupancy rate
- LastUpdated: Date and Time of the measure
Souce: Daniel H. Stolfi, dhstolfi ‘@’ lcc.uma.es, University of Malaga – Spain.
Occupancy Detection Data Set: Experimental data used for binary classification (room occupancy) from Temperature, Humidity, Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.
Data Set Information: Three data sets are submitted, for training and testing. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.
Attribute Information:
- date time year-month-day hour:minute: second
- Temperature, in Celsius
- Relative Humidity, %
- Light, in Lux
- CO2, in ppm
- Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air
- Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status
Source: Luis Candanedo, luismiguel.candanedoibarra ‘@’ umons.ac.be, UMONS.
Multi-Domain Sentiment Dataset: The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics. Each domain has several thousand reviews, but the exact number varies by domain. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.
A few notes regarding the data.
1) There are 4 directories corresponding to each of the four domains. Each directory contains 3 files called positive.review, negative.review and unlabeled.review. (The book’s directory doesn’t contain the unlabeled but the link is below.) While the positive and negative files contain positive and negative reviews, these aren’t necessarily the splits we used in the experiments. We randomly drew from the three files ignoring the file names.
2) Each file contains a pseudo XML scheme for encoding the reviews. Most of the fields are self-explanatory. The reviews have a unique ID field that isn’t very unique. If it has two unique id fields, ignore the one containing only a number.
There are always small details and I am sure that I omitted many of them. If you have a question after reading the paper and this page, please let me know.
This sentiment dataset was used for the paper: John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007.
Note: If you use this data for your research or a publication, please cite the above paper as the reference for the data. Also, please drop me a line so I know that you found the data useful.
Free Spoken Digit Dataset (FSDD)FSDD is an open dataset, which means it will grow over time as data is contributed.it is a simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.
- 4 speakers
- 2,000 recordings (50 of each digit per speaker)
- English pronunciations
Files are named in the following format: {digitLabel}_{speakerName}_{index}.wav Example: 7_jackson_32.wav
Sentiment 140:This dataset contains 16 lakhs tweets that are labelled as positive or negative with each class having 8 lakh tweets.This dataset was made by the students at Stanford. Their approach was unique because the training data was automatically created, as opposed to having humans manual annotate tweets. In the approach, it was assumed that any tweet with positive emoticons, like :), are positive, and tweets with negative emoticons , like 🙁 are negative. They used the Twitter Search API to collect these tweets by using a keyword search. This is described in their paper.
The dataset is described as:
- The data is a CSV with emoticons removed. Data file format has 6 fields:
- 0 – the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- 1 – the id of the tweet (2087)
- 2 – the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- 3 – the query (lyx). If there is no query, then this value is NO_QUERY.
- 4 – the user that tweeted (robotickilldozr)
- 5 – the text of the tweet (Lyx is cool)
Note: This dataset is not open-source. In case you use this dataset, please cite Sentiment140 as your source.