The novel context of Big Data has demonstrated that classical relational databases are not suitable: novel platforms for managing an incredible variety of datasets have become necessary, as demonstrated by the popularity of “data lakes” and “data lakehouses”.
One common issue of modern data platforms is to detect pairs of datasets that concern the same topic. However, a matching that is purely syntactic is not effective: the exploitation of modern AI techniques for Natural-Language Processing, such as word embedding and sentence embedding, promise to address the issue in a (more or less) semantic way.
The contribution of the paper is a novel methodology (called “TopicRank”) for flexible querying data platforms, so as to find out pairs of datasets that concern the same topic, on the basis of the textual description that accompany datasets as meta-data. The paper presents the results of a preliminary experiment that was conducted on a real pool of datasets.