Back to posts

๐Ÿ“Š Datasets poem

January 28, 2022

Originally posted on twitter.

Titanic is tiring ๐Ÿšข

Iris is irritating ๐Ÿฅ€

MNIST is too easy ๐Ÿฅฑ

Boston makes me queasy ๐Ÿคข

California housing is not so bad ๐Ÿก

Sentiment analysis just makes me sad ๐Ÿฅฒ

Here are the datasets that I gravitate to... ๐Ÿงต

What about you? ๐Ÿ™Œ

For an introduction to ML so gently ๐Ÿ‘ผ

The Palmer Penguins are nice and friendly ๐Ÿ’ž

It has nice illustrations too ๐Ÿ‘ฉโ€๐ŸŽจ

Meet the Chinstrap, Adelie and Gentoo ๐Ÿง

Now if you want something a bit grander ๐Ÿ—ฝ

The New York Taxi has some splendor ๐Ÿš•

Does this go beyond your RAM? ๐Ÿ™…

Analyze it out-of-core if you can ๐Ÿš€

The MNIST dataset might be passe ๐Ÿ’พ

Roman numerals are drawn another way โœ๏ธ

A bit of noise and style imbalance โš–๏ธ

Are sure to give you a nice challenge ๐Ÿ†

An important part of any ML course ๐Ÿ‘จโ€๐Ÿซ

Is to consider the model's source ๐Ÿ“ฐ

Even famous datasets contain mistakes โŽ

Finding them can be a nice chase ๐Ÿ†

Some datasets we cannot see ๐Ÿ‘€

Powering models like BERT, the friend of Ernie ๐Ÿ“บ

We can question it with clever tasks ๐Ÿ› 

Take a look behind the word masks ๐ŸŽญ

That bias is not a bug ๐Ÿชณ

Don't sweep it under the rug ๐Ÿงน

This topic deserves a reminder ๐Ÿ“†

These systems should be kinder ๐Ÿค—

Sources

  1. Palmer Penguins: https://allisonhorst.github.io/palmerpenguins/articles/intro.html
  2. New York Taxi: https://github.com/vaexio/vaex-talks/blob/master/2019-pydata-london/PyData-London-2019-vaex-EDA-ML.ipynb
  3. Roman Numerals: https://https-deeplearning-ai.github.io/data-centric-comp/
  4. Dataset Errors: https://labelerrors.com
  5. Hugging Face LLM Course: https://huggingface.co/learn/llm-course/chapter1/8?fw=pt