COMPARING TOOLS PROVIDED BY PYTHON AND R FOR EXPLORATORY DATA ANALYSIS

Mahathir Rahmany(1), Abdullah Mohd Zin(2), Elankovan A Sundararajan(3),


(1) Centre for Software Technology and Management, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor, 43600
(2) Centre for Software Technology and Management, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor, 43600
(3) Centre for Software Technology and Management, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor, 43600
Corresponding Author

Abstract


To uncover the insight behind the data, a comprehensive analysis is needed. Exploratory Data Analysis (EDA) is one of practical data analysis that will guide how to reveal any hidden information in the data. By doing EDA, any pattern and issue in the data will be seen and eventually will lead the hypothesis. To do EDA, besides any basic statistic is needed, a good tool to simplify the analysis is also a consideration. Python and R as a famous programming language in the data science world provide methods to implement that analysis. This paper will show how to perform EDA by utilizing Python and R programming.


Keywords


exploratory data analysis, EDA, Python, R, statistics

References


REFERENCES

M. Huebner, W. Vach, and S. [le Cessie], “A systematic approach to initial data analysis is good research practice,” J. Thorac. Cardiovasc. Surg., vol. 151, no. 1, pp. 25–27, 2016.

M. Staniak and P. Biecek, “The Landscape of R Packages for Automated Exploratory Data Analysis,” arXiv e-prints, p. arXiv:1904.02101, Mar. 2019.

K. A. Monsen, “Exploratory Data Analysis,” in Intervention Effectiveness Research: Quality Improvement and Program Evaluation, Cham: Springer International Publishing, 2018, pp. 77–85.

S. Putatunda, K. Rama, D. Ubrangala, and R. Kondapalli, “SmartEDA: An R Package for Automated Exploratory Data Analysis,” arXiv Prepr. arXiv1903.04754, 2019.

C. M. Carbery, R. Woods, and A. H. Marshall, “A New Data Analytics Framework Emphasising Pre-processing in Learning AI Models for Complex Manufacturing Systems,” in Intelligent Computing and Internet of Things, 2018, pp. 169–179.

G. L. Taboada, I. Seruca, C. Sousa, and Á. Pereira, “Exploratory Data Analysis and Data Envelopment Analysis of Construction and Demolition Waste Management in the European Economic Area,” Sustainability, vol. 12, no. 12, p. 4995, 2020.

A. Bezerra, I. Silva, L. A. Guedes, D. Silva, G. Leitão, and K. Saito, “Extracting Value from Industrial Alarms and Events: A Data-Driven Approach Based on Exploratory Data Analysis,” Sensors, vol. 19, no. 12, p. 2772, Jun. 2019.

“Understanding Clinical Data using Exploratory Analysis,” Int. J. Recent Technol. Eng., vol. 8, no. 5, pp. 5434–5437, Jan. 2020.

Z. Jones and F. Linder, “Exploratory data analysis using random forests,” in Prepared for the 73rd annual MPSA conference, 2015.

V. Indhumathi and N. Dharani, “Estimation of Data Analysis In R to Predict Diabetes,” Int. J. Emerg. Technol. Innov. Eng., vol. 6, no. 1, 2020.

E. Camizuli and E. J. Carranza, “Exploratory Data Analysis (EDA),” in The Encyclopedia of Archaeological Sciences, American Cancer Society, 2018, pp. 1–7.

S. M. Thaung, H. M. Tun, K. K. K. Win, M. M. Than, and A. S. S. Phyo, “Exploratory data analysis based on remote health care monitoring system by using IoT,” Communications, vol. 8, no. 1, pp. 1–8, 2020.

Y. Mao and others, “Data Visualization in Exploratory Data Analysis: An Overview of Methods and Technologies,” 2015.

S. Kaleru and S. R. Dhanikonda, “Exploratory Data Analysis and Latent Dirichlet Allocation on Yelp Database,” Int. J. Appl. Eng. Res., vol. 13, no. 21, pp. 15035–15039, 2018.

J. W. Tukey, Exploratory data analysis, vol. 2. Reading, MA, 1977.

K. Wongsuphasawat, Y. Liu, and J. Heer, “Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study,” arXiv Prepr. arXiv1911.00568, 2019.

S. Thiprungsri, M. A. Vasarhelyi, A. Kogan, M. Alles, and J. J. Ye, “Cluster analysis for anomaly detection in accounting,” in Rutgers Studies in Accounting Analytics: Audit Analytics in the Financial Industry (Rutgers Studies in Accounting Analytics), Emerald Publishing Limited, 2019, pp. 87–110.

T. T. Allen, Z. Sui, and K. Akbari, “Exploratory text data analysis for quality hypothesis generation,” Qual. Eng., vol. 30, no. 4, pp. 701–712, 2018.

Z. Maznah, M. Halimah, M. Shitan, P. K. Karmokar, and S. Najwa, “Prediction of Hexaconazole Concentration in the Top Most Layer of Oil Palm Plantation Soil Using Exploratory Data Analysis ({EDA}),” {PLOS} {ONE}, vol. 12, no. 1, p. e0166203, Jan. 2017.

X. Ma et al., “Using Visual Exploratory Data Analysis to Facilitate Collaboration and Hypothesis Generation in Cross-Disciplinary Research,” ISPRS Int. J. Geo-Information, vol. 6, no. 11, 2017.

C. M. Igwenagu, “EXPLORATORY DATA ANALYSIS AND MULTIVARIATE STRATEGIES FOR REVEALING MULTIVARIATE STRUCTURES IN CLIMATE DATA,” 2016.

A. Ghosh, M. Nashaat, J. Miller, S. Quader, and C. Marston, “A comprehensive review of tools for exploratory analysis of tabular industrial datasets,” Vis. Informatics, vol. 2, no. 4, pp. 235–253, 2018.

J. R. Dettori and D. C. Norvell, “The Anatomy of Data,” Glob. Spine J., vol. 8, no. 3, pp. 311–313, Jan. 2018.

V. Cox, “Exploratory Data Analysis,” in Translating Statistics to Make Decisions : A Guide for the Non-Statistician, Berkeley, CA: Apress, 2017, pp. 47–74.

S. Deshpande, N. J. Gogtay, and U. M. Thatte, “Measures of central tendency and dispersion,” J. Assoc. Physicians India, vol. 64, pp. 64–66, 2016.

E. G. M. Hui, “Descriptive Statistics,” in Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics, Berkeley, CA: Apress, 2019, pp. 87–127.

C. C. Aggarwal, “An Introduction to Outlier Analysis,” in Outlier Analysis, Cham: Springer International Publishing, 2017, pp. 1–34.

M. Z. Iqbal, S. Habib, M. I. Khan, and M. Kashif, “COMPARISON OF DIFFERENT TECHNIQUES FOR DETECTION OF OUTLIERS IN CASE OF MULTIVARIATE DATA,” Pak. J. Agri. Sci, vol. 57, no. 3, pp. 865–869, 2020.

A. Zheng and A. Casari, Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, 2018.

N. Iftikhar, T. Baattrup-Andersen, F. Nordbjerg, E. Bobolea, and P.-B. Radu, “Data Analytics for Smart Manufacturing: A Case Study,” in Proceedings of the 8th International Conference on Data Science, Technology and Applications, 2019.

S. K. Sharma, T. Kanchan, and K. Krishan, “Descriptive Statistics,” in The Encyclopedia of Archaeological Sciences, American Cancer Society, 2018, pp. 1–8.

T. C. Guetterman, “Basics of statistics for primary care research,” Fam. Med. Community Heal., vol. 7, no. 2, p. e000067, Mar. 2019.

M. Komorowski, D. C. Marshall, J. D. Salciccioli, and Y. Crutain, “Exploratory Data Analysis,” in Secondary Analysis of Electronic Health Records, Cham: Springer International Publishing, 2016, pp. 185–203.

A. G. Bronevich and J. V. de Oliveira, “On the model updating operators in univariate estimation of distribution algorithms,” Nat. Comput., vol. 15, no. 2, pp. 335–354, May 2015.

A. Bertani, G. Di Paola, E. Russo, and F. Tuzzolino, “How to describe bivariate data,” J. Thorac. Dis., vol. 10, no. 2, pp. 1133–1137, Feb. 2018.

S. McQuitty, “The Purposes of Multivariate Data Analysis Methods: an Applied Commentary,” J. African Bus., vol. 19, no. 1, pp. 124–142, 2018.

X. He, Y. Tao, Q. Wang, and H. Lin, “A co-analysis framework for exploring multivariate scientific data,” Vis. Informatics, vol. 2, no. 4, pp. 254–263, 2018.

M. A. Islam and A. Al-Shiha, “Basic Summary Statistics,” in Foundations of Biostatistics, Singapore: Springer Singapore, 2018, pp. 39–72.

U. S. Ali, “A Case Study on Teaching of Fundamental Aspects of Central Tendency by Using Classroom Activities at Secondary School Level, Karachi, Pakistan,” RADS J. Soc. Sci. Bus. Manag., vol. 3, no. 2, pp. 41–56, 2016.

A. Gupta, P. Mishra, C. Pandey, U. Singh, C. Sahu, and A. Keshri, “Descriptive statistics and normality tests for statistical data,” Ann. Card. Anaesth., vol. 22, no. 1, p. 67, 2019.

P. Shah et al., “Pancreatic Glucagon secretion is severely impaired and Somatostatin secretion unchanged in patients with Hyperinsulinaemic Hypoglycaemia,” in 55th Annual ESPE, 2016, vol. 86.

R. W. Cooksey, “Descriptive Statistics for Summarising Data,” in Illustrating Statistical Procedures: Finding Meaning in Quantitative Data, Singapore: Springer Singapore, 2020, pp. 61–139.

O. O. Mosobalaje, O. D. Orodu, and D. Ogbe, “Descriptive statistics and probability distributions of volumetric parameters of a Nigerian heavy oil and bitumen deposit,” J. Pet. Explor. Prod. Technol., vol. 9, no. 1, pp. 645–661, Jun. 2018.

E. Mooi, M. Sarstedt, and I. Mooi-Reci, “Descriptive Statistics,” in Springer Texts in Business and Economics, Springer Singapore, 2017, pp. 95–152.

F. Kaliyadan and V. Kulkarni, “Types of variables, descriptive statistics, and sample size,” Indian Dermatol. Online J., vol. 10, no. 1, p. 82, 2019.

Z. Ali and Sb. Bhaskar, “Basic statistical tools in research and data analysis,” Indian J. Anaesth., vol. 60, no. 9, p. 662, 2016.

L. Igual and S. Seguí, “Descriptive Statistics,” in Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications, Cham: Springer International Publishing, 2017, pp. 29–50.

R. Ewing and K. Park, Basic Quantitative Research Methods for Urban Planners. Taylor & Francis, 2020.

S. Ismail and S. Ahmed, “Air Pollution, its Sources and Health Effects: A Case Study of Delhi,” Res. J. Soc. Sci., vol. 9, no. 4, pp. 62–74, 2018.

M. Sarstedt and E. Mooi, “Descriptive Statistics,” in A Concise Guide to Market Research: The Process, Data, and Methods Using IBM SPSS Statistics, Berlin, Heidelberg: Springer Berlin Heidelberg, 2019, pp. 91–150.

A. Indrabudiman, “Descriptive Analysis stock price with Zmijewski bankruptcy model to total assets on stock prices,” Int. J. Sci. Res. Sci. Technol. Vol., vol. 3, 2017.

S. Boslaugh, Encyclopedia of Epidemiology. {SAGE} Publications, Inc., 2008.

N. Fitzallen, “Interpreting Association from Graphical Displays.,” Math. Educ. Res. Gr. Australas., 2016.

D. Borcard, F. Gillet, and P. Legendre, “Exploratory Data Analysis,” in Numerical Ecology with R, Cham: Springer International Publishing, 2018, pp. 11–34.

M. R. M. Huddar and R. V Kulkarni, “Role of R and Python in Data Science,” Res. JOURNEY, p. 32, 2018.

S. K. A. Fahad and A. E. Yahya, “Big Data Visualization: Allotting by R and Python with GUI Tools,” in 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 2018, pp. 1–8.

C. D. Larose and D. T. Larose, Data Science Using Python and R. Wiley, 2019.

J. Bernard, “Python Data Analysis with pandas,” in Python Recipes Handbook: A Problem-Solution Approach, Berkeley, CA: Apress, 2016, pp. 37–48.

I. Meniailov, K. Bazilevych, K. Fedulov, and S. Goranina, “Using the K-means method for diagnosing cancer stage using the Pandas library,” development, vol. 14, p. 15, 2019.

F. Nelli, “The pandas Library---An Introduction,” in Python Data Analytics: With Pandas, NumPy, and Matplotlib, Berkeley, CA: Apress, 2018, pp. 87–139.

M. Bauer and M. Garland, “Legate NumPy: Accelerated and Distributed Array Computing,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019.

J. Hunt, “Introduction to Matplotlib,” in Advanced Guide to Python 3 Programming, Cham: Springer International Publishing, 2019, pp. 35–42.

E. Bisong, “Matplotlib and Seaborn,” in Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, Berkeley, CA: Apress, 2019, pp. 151–165.

O. Embarak, “Data Visualization,” in Data Analysis and Visualization Using Python: Analyze Data to Create Visualizations for BI Systems, Berkeley, CA: Apress, 2018, pp. 293–342.

M. Allen, The {SAGE} Encyclopedia of Communication Research Methods. {SAGE} Publications, Inc, 2017.


Full Text: PDF

Article Metrics

Abstract View : 52 times
PDF Download : 4 times

Refbacks

  • There are currently no refbacks.