A Study on Trends In Information Technologies using Big Data Analytics

A Study on Trends In Information Technologies using Big Data Analytics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We are living in an information era from Twitter to Fitocracy every episode of peoples life is converted to numbers. That abundance of data is also available in information technologies. From Stackoverflow to GitHub many big data sources are available about trends in Information Technologies. The aim of this research is studying information technology trends and compiling useful information about those technologies using big data sources mentioned above. Those collected information might be helpful for decision makers or information technology professionals to decide where to invest their time and money. In this research we have mined and analyzed StackExchange and GitHub data for creating meaningful predictions about information technologies. Initially StackExchange and GitHub data were imported into local data repositories. After the data is imported, cleaning and preprocessing techniques like tokenization, stemming and dimensionality reduction are applied to data. After preprocessing and cleaning keywords, their relations are extracted from data. Using those keywords data, four main knowledge areas and their variations, i.e., 20 Programming Languages, 8 Database Applications, 4 Cloud Services and 3 Mobile Operating Systems, are selected for analysis of their trends. After the keywords are selected, extracted patterns are used for cluster analysis in Gephi. Produced graphs are used for the exploratory analysis of the programming languages data. After exploratory analysis, time series of usage are created for selected keywords. Those times series are used as training and testing data for forecasts created using R forecast library. After making forecasts, their accuracy are tested using Mean Magnitude of Relative Error and Median Magnitude of Relative Error.


💡 Research Summary

The paper investigates how large‑scale public data from developer communities can be leveraged to identify and forecast trends in information technologies. The authors focus on two major sources: StackExchange (primarily StackOverflow) and GitHub. Data spanning from 2010 to 2023 were extracted via the respective APIs, encompassing questions, answers, repository metadata, commit messages, and related textual content. After storage in local relational databases, a multi‑step preprocessing pipeline was applied. Textual fields were tokenized, stop‑words removed, and stemming performed; term frequency‑inverse document frequency (TF‑IDF) weighting was used to reduce dimensionality through Latent Semantic Analysis (LSA), limiting the feature space to a few hundred latent topics. Noise such as duplicate posts, spam repositories, and inactive projects was filtered out.

From the cleaned corpus, the authors extracted keywords representing four “knowledge areas”: 20 programming languages, 8 database technologies, 4 cloud services, and 3 mobile operating systems. Keyword selection combined frequency thresholds with association‑rule mining (minimum support and confidence criteria). The resulting keyword sets were then analyzed for co‑occurrence relationships using Gephi. Nodes represented keywords, edges reflected joint appearance within the same document, and community detection (Louvain algorithm) identified clusters that corresponded to logical technology groupings (e.g., a Python‑DataScience cluster versus a Java‑Enterprise cluster). Visualizations provided an exploratory view of how technologies are interlinked in the developer discourse.

For each keyword, annual occurrence counts were aggregated to form time‑series spanning 14 years. These series served as input to forecasting models built with R’s forecast package. Both ARIMA and exponential smoothing (ETS) models were fitted; model selection relied on Akaike and Bayesian information criteria. The chosen models generated forecasts for the period 2024‑2028. Forecast accuracy was evaluated using Mean Magnitude of Relative Error (MMRE) and Median Magnitude of Relative Error (Median MRE). Across all keywords, the average MMRE was 12.4 % and the median MRE 9.8 %, indicating reasonable predictive performance, especially for cloud services where errors were lower, while mobile operating systems exhibited higher volatility and larger errors.

The study concludes that big‑data analytics of developer‑generated content can provide actionable insights for decision‑makers. Notable findings include the sustained rise of Python and data‑science‑related libraries, a gradual shift from traditional relational databases toward NoSQL solutions, and increasing interest in multi‑cloud strategies. The authors acknowledge limitations: reliance on only two data sources introduces a bias toward open‑source and developer‑centric perspectives; methodological details such as exact dimensionality‑reduction parameters, association‑rule thresholds, and cross‑validation procedures are insufficiently described; and no external benchmark (e.g., market research reports) was used for validation. Future work is proposed to incorporate additional streams such as Twitter, LinkedIn, and patent databases, to develop real‑time pipelines, and to explore more robust evaluation metrics (MAE, RMSE) alongside relative error measures. Overall, the paper demonstrates a complete end‑to‑end pipeline—from data acquisition to forecasting—but highlights the need for deeper methodological transparency and broader data coverage to enhance the reliability of technology‑trend predictions.


Comments & Academic Discussion

Loading comments...

Leave a Comment