Originally posted on twitter.
Titanic is tiring ๐ข
Iris is irritating ๐ฅ
MNIST is too easy ๐ฅฑ
Boston makes me queasy ๐คข
California housing is not so bad ๐ก
Sentiment analysis just makes me sad ๐ฅฒ
Here are the datasets that I gravitate to... ๐งต
What about you? ๐
For an introduction to ML so gently ๐ผ
The Palmer Penguins are nice and friendly ๐
It has nice illustrations too ๐ฉโ๐จ
Meet the Chinstrap, Adelie and Gentoo ๐ง
Now if you want something a bit grander ๐ฝ
The New York Taxi has some splendor ๐
Does this go beyond your RAM? ๐
Analyze it out-of-core if you can ๐
The MNIST dataset might be passe ๐พ
Roman numerals are drawn another way โ๏ธ
A bit of noise and style imbalance โ๏ธ
Are sure to give you a nice challenge ๐
An important part of any ML course ๐จโ๐ซ
Is to consider the model's source ๐ฐ
Even famous datasets contain mistakes โ
Finding them can be a nice chase ๐
Some datasets we cannot see ๐
Powering models like BERT, the friend of Ernie ๐บ
We can question it with clever tasks ๐
Take a look behind the word masks ๐ญ
That bias is not a bug ๐ชณ
Don't sweep it under the rug ๐งน
This topic deserves a reminder ๐
These systems should be kinder ๐ค
Sources
- Palmer Penguins: https://allisonhorst.github.io/palmerpenguins/articles/intro.html
- New York Taxi: https://github.com/vaexio/vaex-talks/blob/master/2019-pydata-london/PyData-London-2019-vaex-EDA-ML.ipynb
- Roman Numerals: https://https-deeplearning-ai.github.io/data-centric-comp/
- Dataset Errors: https://labelerrors.com
- Hugging Face LLM Course: https://huggingface.co/learn/llm-course/chapter1/8?fw=pt