Estimating General Linear Regression Model of Big Data by Using Multiple Test Technique

Authors

  • Ahmed Mahdi Salih Department of Statistics / College of Administration and Economic, University of Wasit / Iraq Author https://orcid.org/0000-0002-6109-224X
  • Munaf Yousif Hmood Department of Statistics / College of Administration and Economics, University of Baghdad / Iraq Author

DOI:

https://doi.org/10.62933/e6740m35

Keywords:

Big Data, Ridge regression , Social Deprivation , Multiple tests

Abstract

Big Data analyses attract many researchers to create or develop new efficient statistical techniques to analyse big sets of data and deal with the problems that Big Data bring like noise accumulation and multicollinearity. This work presents an innovative approach to estimate the generic linear regression model of Big Data using several test processes. Researchers are faced with a great problem when it comes to big data analysis, which is why they should be developing new techniques for estimating the general linear regression model. Information has been collected from the Central Statistics Organization IRAQ which is represented by the Social Deprivation Index SDI. Where the concept of the SDI indicator was cleared, and all its contents were, and we showed how the SDI indicator was calculated. Two methods have been chosen to estimate the general linear regression model: our proposed method, which represents an adapted estimation method of the OCMT estimation method by using a ratio of quadratic forms as a multiple test procedure to select the variables in the general linear regression model, and the traditional method, Ridge regression RR, which is present to deal with big sets of data. One measure that has been used to compare the approaches is the mean square error, or MSE. Here we compare one classical method RR which depends on adding some positive quantities to avoid singularity of X’X matrix and a proposed method that depends on selecting variables. Last, we conclude that our proposed estimator, which depends on the multiple test procedure, is the best and has the best performance.

References

[1] P. Bühlmann and S. Van De Geer, Statistics for high-dimensional data: methods, theory and applications, Springer Science & Business Media, 2011.

[2] D. Boyd and K. Crawford, "Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon," Information, communication & society, vol. 15, p. 662–679, 2012.

[3] W. L. Chang, N. Grady and others, "NIST big data interoperability framework: volume 1, big data definitions," 2015.

[4] A. Chudik, G. Kapetanios and M. H. Pesaran, "A one covariate at a time, multiple testing approach to variable selection in high-dimensional linear regression models," Econometrica, vol. 86, p. 1479–1512, 2018.

[5] A. De Mauro, M. Greco and M. Grimaldi, "What is big data? A consensual definition and a review of key research topics," in AIP conference proceedings, 2015.

[6] A. E. Hoerl and R. W. Kennard, "Ridge Regression: Biased Estimation for Nonorthogonal Problems," Technometrics, vol. 12, p. 55–67, February 1970.

[7] Y. Benjamini and Y. Hochberg, "Controlling the false discovery rate: a practical and powerful approach to multiple testing," Journal of the Royal statistical society: series B (Methodological), vol. 57, p. 289–300, 1995.

[8] M. H. Pesaran and R. P. Smith, "Signs of impact effects in time series regression models," Economics Letters, vol. 122, p. 150–153, 2014.

[9] A. F. Barrientos and V. Peña, "Bayesian bootstraps for massive data," 2020.

[10] B. Velten and W. Huber, "Adaptive penalization in high-dimensional regression and classification with external covariates using variational Bayes," Biostatistics, vol. 22, p. 348–364, 2021.

[11] J. Fan and J. Lv, "Sure Independence Screening for Ultrahigh Dimensional Feature Space," Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 70, p. 849–911, October 2008.

[12] M. Nikolova, "Local Strong Homogeneity of a Regularized Estimator," SIAM Journal on Applied Mathematics, vol. 61, p. 633–658, January 2000.

[13] سهيل نجم عبود and ايناس صلاح خورشيد, "Comparison between the Methods of Ridge Regression and Liu Type to Estimate the Parameters of the Negative Binomial Regression Model Under Multicollinearity Problem by Using Simulation," journal of Economics And Administrative Sciences, vol. 24, 2018.

[14] J. J. Goeman, H. C. van Houwelingen and L. Finos, "Testing against a high-dimensional alternative in the generalized linear model: asymptotic type I error control," Biometrika, vol. 98, p. 381–390, May 2011.

[15] A. W. v. d. Vaart, Asymptotic Statistics, Cambridge University Press, 1998.

[16] J. J. Goeman, S. A. Van De Geer and H. C. Van Houwelingen, "Testing Against a High Dimensional Alternative," Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 68, p. 477–493, April 2006.

[17] A. M. Salih and M. Y. Hmood, "Analyzing big data sets by using different panelized regression methods with application: surveys of multidimensional poverty in Iraq," Periodicals of Engineering and Natural Sciences (PEN), vol. 8, p. 991–999, 2020.

[18] A. M. Salih and M. Y. Hmood, "Big data analysis by using one covariate at a time multiple testing (OCMT) method: Early school dropout in Iraq," International Journal of Nonlinear Analysis and Applications, vol. 12, p. 931–938, 2021.

[19] P. D. and K. Ahmed, "A Survey on Big Data Analytics: Challenges, Open Research Issues and Tools," International Journal of Advanced Computer Science and Applications, vol. 7, 2016.

[20] K. Xu, "A new nonparametric test for high-dimensional regression coefficients," Journal of Statistical Computation and Simulation, vol. 87, p. 855–869, September 2016.

[21] R. Kumar, B. Moseley, S. Vassilvitskii and A. Vattani, "Fast Greedy Algorithms in MapReduce and Streaming," ACM Transactions on Parallel Computing, vol. 2, p. 1–22, September 2015.

[22] W. James and C. Stein, "Estimation with Quadratic Loss," in Breakthroughs in Statistics, Springer New York, 1992, p. 443–460.

[23] لقاء علي محمد and صابرين حسين كاظم, "Estimate Kernel Ridge Regression Function in Multiple Regression," journal of Economics And Administrative Sciences, vol. 24, 2018.

Downloads

Published

2025-05-11

Issue

Section

Original Articles

How to Cite

Estimating General Linear Regression Model of Big Data by Using Multiple Test Technique. (2025). Iraqi Statisticians Journal, 2(special issue for ICSA2025), 337-343. https://doi.org/10.62933/e6740m35