An original template solution for FAIR scientific text mining

the

software is plug-and-play.Orange can be run via visual editor or can be accessed as Python library.Yet for many academicians, R is the preferred language due to its strong statistics and visualization libraries.Our methodology is simpler to use and more customizable than any plug-and-play software especially for those with academic background and basic understanding of R. The presented text mining methodology and the accompanying code in this paper are generalized to provide insights into a wide range of scientific topics.The provided code can be used both as template for scientific text mining and as educational material.The code in this paper is open-source and made available on GitHub under the Apache V2 license [6] .The next sections describe the process of collecting data for text mining and the actual steps of text mining.

Searching and collecting literature data
Before the text mining begins, the researcher needs to search and collect articles based on pre-defined search criteria.For a text mining analysis to adhere to the Findable Accessible Interoperable and Reproducible (FAIR) principles [7] , the researcher needs to provide the a) reference to the download source, b) search criteria, and c) exact timeframe for which published articles are retrieved.A supplementary file with a list of DOI's of used articles or abstracts and the used code should be provided for maximum reproducibility.
There are multiple ways to semi-automatically or fully automatically download articles or abstracts for text mining from different literature databases.First is the manual retrieval of abstracts based on various keywords criteria using the Science Direct advanced search function.Article abstract and metadata can be downloaded as BibTex 100 at a time.The script [1_BiTex2TSVfile.py]can be used to convert BibTex citations to a Tab Separated Volume (TSV) file than can be loaded for analysis.Second is the automatic retrieval of abstracts and meta-data using Arcas Python library.Arcas can scrape article abstracts and metadata from the IEEE, PLOS, Nature Springer, and arXiv API's on a large-scale [9] .The example code for accessing the above mentioned APIs for journal and preprint publishers using Arcas is provided in our template code [1_Scrape_articles_arcas.py].Third is the retrieval of full text articles as PDF's.If you are affiliated to an organization that has a subscription to Elsevier, you can download as many full articles as desired using the Science Direct API, see the Elsevier Developer portal [8] .

Text Mining Steps
Once the literature data has been collected, the next step is to conduct text mining.The three steps for text mining are: (1) loading & pre-processing of data, (2) processing, and (3) analysis of results & visualizations.Each step is explained below.

Loading & pre-processing of data
Text mining data can be either in the form of PDF's or a Tab Separated File that contains article abstracts and metadata.If the data consists of PDF's, the R pdftools library can be used to load all article PDF's directly in R using our template code.The same goes for of a Tab Separated File.
Our template solution uses the tm package for further analysis .Before loading text into a text corpus using tm , the type of analysis to be performed must be taken into consideration.It is useful to first perform a text mining analysis of all text loaded as a single text corpus to explore the data, to familiarize with the code, and to better understand the literature.The template code for the analysis of a single text corpus is provided [Text_Mining_single_document_group_1-3A.R].In this first analysis of the text, the frequency of Terms will be calculated and the correlation between terms will be shown including the automatic detection of groups of terms based on their clustering.In most cases, this analysis is sufficient to achieve the researcher's text mining objectives.For more advanced analysis, such as text mining analysis based on groups of articles, groups of terms, or combined groups of Documents and Terms, a template solution is also provided [Text_Mining_multiple_document_groups_1-3B.R].Lastly we provide template code specifically for time-based text mining analysis [Text_Mining_time_based_analysis_1-3C.R].
The next step after loading text as a text corpus object is to create the "Term Document Matrix " (TDM) using the tm TermDocu-mentMatrix() function.As shown in the example code, various parameters can be used to clean the text of stop-words, punctuations, format to lower case, remove numbers and bring words back to their stem word.It is advisable to properly clean text before running any analysis.A TDM based on text of various lengths, such as an article, is not suitable for direct analysis because of a) the difference in word counts due to differences in the text length and b) the writing style of the author.Before starting the analysis, the data need to be pre-processed first.The data can be normalized by converting the TDM to a binary matrix which can be used as input for either a correlation analysis or for calculating the word frequencies.Additionally, by converting the TDM to either present or absent, the binomial statistics can be used to calculate the confidence interval of Terms that are present and absent in groups of Documents.
If the text mining based on a single text corpus is not fine-grained enough, the researcher can continue with text mining involving multiple categories or groups, or perform a time-based analysis.The main difference of using multiple categories with using a single text corpus is that each group of articles is loaded in a separate text corpus.For classification of articles, the researcher can choose between a) manual supervised classification or b) unsupervised classification.An example of manual classification would be separating articles based on their metadata such as the author, year of publishing a specific topic or the presence of a keyword.An example of unsupervised classification would be to use K-Means clustering to determine the optimal number of clusters/groups to segregate articles.In addition, the researcher can use supervised or unsupervised groups of words to analyze a single or a group of documents.Tolentino-Zondervan & Zondervan [2] found manual classification of both documents and terms most useful and well suited when working with a small to medium number of articles and very useful when combined with in-depth literature review.

Processing
To gain quantitative understanding, the binary TDM is used to count the frequency of documents that contain a term.This count can be done to either a single or multiple groups of documents, which can also be time series data.Since these counts follow a binary distribution, we use binary statistics to calculate the Mean and Standard Deviation of the relative word frequency, the confidence interval, and the significance of groups being different based on a two-sided binomial test.Based on the Term frequency we can drop rare and common words using a cut-off value for the minimal relative word frequency and a cut-off value for the maximum relative word frequency.The term frequency is the first main processed data used to gain insights in our text mining methodology.
The second processed data type use in our final analyses is the Term-Term Matrix (TTM) which we generate from the binary Document Matrix.The TTM counts the number of co-occurrence of Terms in documents and can be used to show the correlation between terms.We generate heatmaps, dendrograms and networks based on the Pearson correlation coefficient for terms in TTM.In our analysis we use the Pearson correlation since it is invariant to both scaling and offset.The term frequency, the fold-change in relative term frequency in groups of documents, as well as the significance of these differences can be used as criteria to identify terms of interests.Additionally manual grouping of documents and terms can be used in our methodology.Code examples for both manual and automated grouping are provided.Our methodology is suitable for both qualitative and quantitative analysis and is very effective when combined with traditional literature review.

Analysis of results & visualizations
As shown in the graphical abstract, there are multiple types of analysis that can be performed using text mining.Our methodology provides code templates for three main types of analysis.The first type is the analysis of a single text corpus (see Graphical Abstract 3A) of documents such as articles or abstract.Firstly, the researcher can correlate either documents, or terms in articles based on the binary TDM.Secondly, the relative word frequency can be calculated from the binary TDM.The relative frequency of words indicates how important they are in the analyzed text and provides some first easy to comprehend quantitative insights in the analyzed documents.
The second type of analysis is the analysis of multiple categories of articles and or analysis based on multiple categories of Terms (see Graphical Abstract 3B).The simplest form of comparing multiple categories is a pairwise analysis of two groups.An example of comparison of multiple groups of articles on multiple groups of Terms can be found in the paper of Tolentino-Zondervan & Zondervan [2] .Since our data contain multiple groups, the binomial statistics can be used to show the confidence interval of terms in different groups of Documents or select those terms that are most significant different between groups.
The third type of analysis that is often used to identify trends is the time-based analysis (see Graphical Abstract 3C).Time-based analysis can be extended with linear or non-linear regression, to forecast trends based on relative keywords abundance.Which type of regression of forecasting should be used depends on the observed trend.For example, if seasonality is detected, the researcher should correct for seasonality.A logical choice for forecasting trends is the R forecast library combined with the R MLmetrics library to score the forecast model using the Mean Absolute Percentage Error (MAPE).

Types of visualizations
In this section we will discuss which types of visualizations our text mining methodology produces and when to use them.In addition, we provide examples of figures summarized in Fig. 1 , which are based on the three peer reviewed articles.Note that all these examples and more can be generated by the template code and mock data provided with this paper.
The choice of visualization depends mainly on the type of research question a researcher wants to address.To best understand which visualization is suitable, we will first have a look at the properties of the processed data.The information we produce with our text mining methodology are a) correlation between terms, b) term frequency, c) term group, d) document group, and e) binomial statistics.
The simplest visualization to explore text data is a word cloud (see Fig. 1 A), which is often used for presentations and only visualized the term frequency data.In the analysis of a single group of documents, we have the combination of a) terms correlation data, b) frequency data, and optionally we use c) term groups based on hierarchical clustering.These three data types can be combined and visualized in a heatmap (see Fig. 1B ) or network graph (see Fig. 1C ).A heatmap will show the correlation between terms with the clustering of terms in the axis dendrograms.Optionally, term grouping and frequency can be shown as row annotation in heatmaps.In a network graph, terms will be presented by nodes, the correlation between these terms will be presented as edges.The node size can be made proportional to the term frequency, and the color of nodes can be based on the grouping of terms.
When analyzing multiple groups of documents, binomial statistics can be performed to provide the confidence interval and the p-value (relative frequency between groups not being the same).Bar graphs can show terms selected based on their fold-change or the significance of them being different in groups.Bar plots can show the confidence interval for the term frequency in a group (see Fig. 1 E).An example of a pairwise-comparison of documents from two groups (two time periods) is a bar graph with confidence interval as presented in Zondervan & Tolentino-Zondervan et al.Fig. 3 [1] .Bubble plots are a good alternative to line plots when dealing with many terms, since they are more organized and can show various properties as their size or color (see Fig. 1 D).Bubble plots also work well when performing a time-based analysis.Alternatively, individual line plots with a trendline and confidence interval are well suited for a time-based analysis (see Fig. 1 F).
For maximal reproducibility we provide a list of the specific version of R and the libraries we used when testing our code in Table 1 .

Fig. 1 .
Fig. 1.A) Word cloud, B) heatmap of terms correlation, C) Network graph of terms correlation D) Bubble plot of terms frequency over time, E) bar graph of proportional terms frequency with error bars, F) Individual line plots with trend lines and confidence interval.

Table 1
List of dependencies.