Thursday 30 May 2024

Common Datasets for Data Science

1. Iris Dataset

  • Description: Contains measurements of different iris flowers.
  • Features: Sepal length, Sepal width, Petal length, Petal width, Species.
  • Use Case: Classification.
  • Link: UCI Machine Learning Repository

2. Titanic Dataset

  • Description: Information about the passengers on the Titanic.
  • Features: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.
  • Use Case: Classification (Survival prediction).
  • Link: Kaggle

3. MNIST Dataset

  • Description: Handwritten digits images.
  • Features: 70,000 images of 28x28 pixels each.
  • Use Case: Image classification.
  • Link: MNIST Database

4. CIFAR-10 Dataset

  • Description: 60,000 32x32 color images in 10 classes.
  • Features: Airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.
  • Use Case: Image classification.
  • Link: CIFAR-10 Dataset

5. Boston Housing Dataset

  • Description: Housing data for Boston.
  • Features: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT, MEDV.
  • Use Case: Regression (predicting house prices).
  • Link: UCI Machine Learning Repository

6. Wine Quality Dataset

  • Description: Chemical properties of red and white wines.
  • Features: Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality.
  • Use Case: Classification and regression.
  • Link: UCI Machine Learning Repository

7. Adult Dataset

  • Description: Census data used for predicting if a person earns more than $50K a year.
  • Features: Age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, income.
  • Use Case: Classification.
  • Link: UCI Machine Learning Repository

8. Heart Disease Dataset

  • Description: Medical data used for predicting heart disease.
  • Features: Age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, target.
  • Use Case: Classification.
  • Link: UCI Machine Learning Repository

9. COCO Dataset

  • Description: Large-scale object detection, segmentation, and captioning dataset.
  • Features: Images with objects, captions, segmentations.
  • Use Case: Image detection, segmentation, captioning.
  • Link: COCO Dataset

10. MovieLens Dataset

  • Description: Movie ratings and metadata.
  • Features: UserId, MovieId, Rating, Timestamp, MovieTitle, Genres.
  • Use Case: Recommendation systems.
  • Link: MovieLens

11. Amazon Reviews Dataset

  • Description: Customer reviews of products on Amazon.
  • Features: Review text, star rating, product information, reviewer’s information.
  • Use Case: Sentiment analysis, recommendation systems.
  • Link: Amazon Customer Reviews (PDS)

12. Yelp Reviews Dataset

  • Description: Reviews of businesses on Yelp.
  • Features: Review text, star rating, business information, reviewer’s information.
  • Use Case: Sentiment analysis, recommendation systems.
  • Link: Yelp Dataset

These datasets are commonly used in data science for various purposes such as classification, regression, clustering, recommendation systems, and image processing. They are available on platforms like Kaggle, UCI Machine Learning Repository, and other open data sources. You can explore them further through the provided links.

Labels:

0 Comments:

Post a Comment

Note: only a member of this blog may post a comment.

<< Home