Supervised Learning
Browse / Data Science Cheatsheet
Data Science Cheatsheet
A comprehensive cheat sheet covering essential concepts, tools, and techniques in Data Science. It provides a quick reference for machine learning algorithms, data manipulation, statistical methods, and more.
Fundamentals
Key Concepts
|
|
Learning from labeled data to predict outcomes. |
|
Unsupervised Learning |
Discovering patterns in unlabeled data. |
|
Reinforcement Learning |
Training an agent to make decisions in an environment to maximize a reward. |
|
Bias-Variance Tradeoff |
Balancing model complexity to minimize both bias (underfitting) and variance (overfitting). |
|
Cross-Validation |
Evaluating model performance on multiple subsets of the data to ensure generalization. |
|
Feature Engineering |
Creating new features or transforming existing ones to improve model accuracy. |
Common Algorithms
|
Linear Regression: Predicts a continuous outcome using a linear equation. |
|
Logistic Regression: Predicts a binary outcome using a logistic function. |
|
Decision Trees: Partitions data into subsets based on feature values to make predictions. |
|
Random Forest: An ensemble of decision trees that averages predictions to improve accuracy. |
|
Support Vector Machines (SVM): Finds the optimal hyperplane to separate data into classes. |
|
K-Nearest Neighbors (KNN): Classifies data based on the majority class among its k nearest neighbors. |
|
K-Means Clustering: Partitions data into k clusters based on distance to cluster centroids. |
Python for Data Science
Data Manipulation with Pandas
|
Creating a DataFrame |
|
|
Selecting Columns |
|
|
Filtering Rows |
|
|
Grouping and Aggregation |
|
|
Handling Missing Data |
|
Data Visualization with Matplotlib and Seaborn
|
Basic Plotting with Matplotlib |
|
|
Scatter Plot with Seaborn |
|
|
Histogram with Seaborn |
|
|
Box Plot with Seaborn |
|
Scikit-learn for Machine Learning
|
Training a Model
|
|
Making Predictions
|
|
Model Evaluation
|
|
Data Preprocessing
|
|
Train-Test Split
|
Statistical Methods
Descriptive Statistics
|
Mean |
Average value of a dataset. |
|
Median |
Middle value of a sorted dataset. |
|
Mode |
Most frequent value in a dataset. |
|
Standard Deviation |
Measure of the spread of data around the mean. |
|
Variance |
Square of the standard deviation. |
Inferential Statistics
|
Hypothesis Testing |
A method for testing a claim or hypothesis about a population parameter. |
|
P-value |
Probability of obtaining results as extreme as the observed results, assuming the null hypothesis is true. |
|
Confidence Interval |
Range of values likely to contain the true population parameter with a certain level of confidence. |
|
T-test |
Used to compare the means of two groups. |
|
ANOVA |
Used to compare the means of more than two groups. |
Model Evaluation and Tuning
Evaluation Metrics
|
Accuracy |
Fraction of correctly classified instances. |
|
Precision |
Fraction of true positives among predicted positives. |
|
Recall |
Fraction of true positives among actual positives. |
|
F1-Score |
Harmonic mean of precision and recall. |
|
AUC-ROC |
Area under the Receiver Operating Characteristic curve, measures the ability of a classifier to distinguish between classes. |
|
Mean Squared Error (MSE) |
Average squared difference between predicted and actual values. |
|
R-squared |
Proportion of variance in the dependent variable that can be predicted from the independent variables. |
Hyperparameter Tuning
|
Grid Search: Exhaustively search a specified subset of the hyperparameters of a learning algorithm. |
|
Randomized Search: Sample a given number of candidates from a hyperparameter search space. |
|
Bayesian Optimization: Uses Bayesian inference to find the hyperparameters that optimize a given metric. |
|
Cross-Validation: Evaluate model performance on multiple subsets of the data to ensure generalization. |