Open science and machine learning competitions for psychiatry

Competition drive innovation.

Data Science and Machine Learning competitions have been becoming increasingly popular since the introduction of Kaggle.

What a typical data science competition is:

It can be compared to outsourcing a given data science problem to the community.

So the first step is to choose a problem and describe it in a way that broad community will understand. It is important to provide motivation for the participants to attract as many as possible. Apart from an interesting problem with a good description, prize money is typically offered in data science competitions. The prize money usually varies from $1K to $1M for the first place.

Data science competitions are based around data. It is important to realize how much data must be prepared, how it should be cleaned and it must be taken care that data do not contain any confidential / classified / sensitive information. The data must be open.

As part of the competition preparation, the data is usually split into:

Public dataset available for participants to train and test their solutions.
The public leaderboard is determined based on this.
Evaluation dataset, which is hidden from the participants and used to determine the competition winners. This allows for an objective comparison.

It is often the case that the top solutions in the public dataset are not the final winners of the competition. They might be overfit to this dataset.

Therefore, the evaluation dataset must be prepared properly and carefuly, so it can reward solutions of really high quality and prevent overfit solutions from winning. It must differ enough from the public dataset but not too much, because otherwise the solutions will not be able to generalize at all.

Advantages of turning problems into competitions:

The effect of scale – large group of independent participants contribute.
The top solution is often the top quality.
Competitions help to validate the given idea and check for potential pitfalls such as data leaks. If such exist, it is likely that some participants will discover them.
Often non-standard, unorthodox and very interesting solutions are found among the submitted ones.
Networking – people interested in the competition topic can get in touch e.g. on the competition forum or post-competition events.
Promotion for the competition giver and their topic.
Possible new research topics and results for scientists and students.
For example – a competition may enable interesting master theses.

[A. Janusz, T. Tajmajer, M. Świechowski, Ł. Grad, J. Puczniewski and D. Ślęzak, “Toward an Intelligent HS Deck Advisor: Lessons Learned from AAIA’ 18 Data Mining Competition,” 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), Poznan, Poland, 2018, pp. 189-192.]

Our aim is to promote the use of data science competitions as tools for boosting research in this area. In addition, we will:

Help you organize a data science competition related to psychiatry at the KnowledgePit platform (https://knowledgepit.ml)
Validate your idea for the competition
Make an audit of data in terms of their potential for a competition
Help in cleaning, preprocessing and anonymization of data
Help in splitting data into public and evaluation datasets
Help in designing a baseline model
Find areas in psychiatry that are suitable for data science and machine learning competitions
Our future goal is to host competitions as ITP Foundation as well

Get involved:

Do not hesitate to contact us if you have data or even an idea for a competition focused on a problem significant to psychiatry.