Notes about the book The Kaggle book
- Chapter 1 - Introducing Kaggle and other data science competition
- Chapter 2 - Organizing Data with Datasets
- Chapter 3 - Working and Learning with Kaggle Notebooks
- Chapter 4 - Leveraging Discussion Foruns
- Kaggle public API docs.
- Kaggle API Github repo.
- Tip: Interact with others on the discussion forum when enrolled on a competition to share and learn.
- Common Task Framework (CTF): Great for advancing state of the art solutions.
- Well defined metrics and quality data
- Competition
- Sharing between competitors
- Compute-resource availability
- What can go wrong in a competition:
- Leakeage from the data: data contain informatio of the target not available in real-time.
- Probing from the leaderboard: Use the leaderboard to metric to tune your solution.
- Overfitting and consequent leaderboard shake-up: cases with huge gap between the training set and the public test set
- Technique to measure discrepancies between training set and test set: https://www.kaggle.com/code/tunguz/adversarial-ieee/notebook
- Private sharing
- Jeremy Howard on how to set you up for success on Kaggle: https://www.kaggle.com/code/jhoward/first-steps-road-to-the-top-part-1
- It is possible to upload a dataset either privately or publicly.
- It is possible to use "Import a GitHub repository" option to import a experimental library not yet available on Kaggle Notebooks.
- Interesting interview from Larxel:
- On creating datasets:
- All in all, the process that I recommend starts with setting your purpose, breaking it down into objectives and topics, formulating questions to fulfil these topics, surveying possible sources of data, selecting and gathering, pre-processing, documenting, publishing, maintaining and supporting, and finally, improvement actions.
- On learning on Kaggle:
- Absorbing all the knowledge at the end of a competition
- Replication of winning solutions in finished competitions
- On creating datasets:
- The easiest way to work with Kaggle datasets is by creating a notebook from the dataset webpage.
-
This section contains a step-by-step to download Kaggle Datasets into Colab.
- Download Kaggle API from your Kaggle account. Place it ~/.kaggle/kaggle.json
- Create folder
Kaggle
on your GDrive and upload .json there. - Mount GDrive to your colab
from google.colab import drive drive.mount('/content/gdrive')
-
Provide path to .json config
import os # content/gdrive/My Drive/Kaggle is the path where kaggle.json is # present in the Google Drive os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle" # change the working directory %cd /content/gdrive/My Drive/Kaggle
- Go to the dataset page and use the
copy API command
. - Run the command on the colab. Ex.:
!kaggle datasets download -d bricevergnou/spotify-recommendation
- Data is downloaded to
os.environ['KAGGLE_CONFIG_DIR']
. Unzip it and you are ready to go.
- Create from scratch
- Create from a dataset
- Fork from a notebook.
- Good manner to upvote the original notebook and give credit if building solution on top.
- You can either run the notebook interactvely or save and run specific versions.
- It is possible to enable automatic sync between Kaggle Notebooks and Github.
- File -> Link to Github
- Use your free GPU/TPU wisely
- The hours start counting as soon as you start your notebook.
- Disable GPU until you write your code, check for syntax and run on small subset of data to check for errors.
- Once it is ready to run change the run-time accelerator.
- Sometimes Kaggle notebooks are not enough. Example data > 100GB.
- Downsample data -> affect performance.
- Use Kaggle datasets on Google Colab as shown in chapter 2.
- Upgrade your Kaggle notebook to use the Google Cloud AI Notebooks. It is not free.
- Nice interview from heads or tails, who specializes onn EDA.
- Kaggle has a list of courses to learn about specific topics.
- Nice interview Andrada Olteanu about how to use notebooks to learn.
- Not really noteworthy in the chapter. Just use the discussion forum to share and learn from other Kagglers.