Table 1
Data volumes of different sky survey projects.
| Sky Survey Projects | Data Volume |
|---|---|
| DPOSS (The Palomar Digital Sky Survey) | 3 TB |
| 2MASS (The Two Micron All-Sky Survey) | 10 TB |
| GBT (Green Bank Telescope) | 20 PB |
| GALEX (The Galaxy Evolution Explorer) | 30 TB |
| SDSS (The Sloan Digital Sky Survey) | 40 TB |
| SkyMapper Southern Sky Survey | 500 TB |
| PanSTARRS (The Panoramic Survey Telescope and Rapid Response System) | ~ 40 PB expected |
| LSST (The Large Synoptic Survey Telescope) | ~ 200 PB expected |
| SKA (The Square Kilometer Array) | ~ 4.6 EB expected |
Table 2
Applied approaches as well as their applications for the main data mining tasks in astronomy.
| Data Mining Tasks | Applied Approaches | Applications in Astronomy |
|---|---|---|
| Classification | Artificial Neural Networks (ANN) Support Vector Machines (SVM) Learning Vector Quantization (LVQ) Decision Trees Random Forest K-Nearest Neighbors Naïve Bayesian Networks Radial Basis Function Network Gaussian Process Decision Table ADTree | Known knowns: – Spectral classification (stars, galaxies, quasars, supernovas) – Photometric classification (stars and galaxies, stars and quasars, supernovas) – Morphological classification of galaxies – Solar activity |
| Regression | Artificial Neural Networks (ANN) Support Vector Regression (SVR) Decision Trees Random Forest K-Nearest Neighbor Regression Kernel Regression Principal Component Regression (PCR) Gaussian Process Least Squared Regression Random Forest Partial Least Squares | Known unknowns: – Photometric redshifts (galaxies, quasars) – Stellar physical parameter measurement ([Fe/H], Teff, logg) |
| Clustering | Principal Component Analysis (PCA) DBScan K-Means OPTICS Cobweb Self Organizing Map (SOM) Expectation Maximization Hierarchical Clustering AutoClass Gaussian Mixture Modeling (GMM) | Unknown unknowns: – Classification – Special/rare object detection |
| Outlier Detection or Anomaly Detection | Principal Component Analysis (PCA) K-Means Expectation Maximization Hierarchical Clustering One-Class SVM | Unknown unknowns: – Special/rare object detection |
| Time-Series Analysis | Artificial Neural Networks (ANN) Support Vector Machines (SVM) Random Forest | Known unknowns: – Novel detection – Trend prediction |
Table 3
Feature selection/extraction methods.
| Feature selection/extraction | Applied approaches | Applications in astronomy |
|---|---|---|
| Feature Selection | Best First Exhaustive Search Greedy Stepwise Random Search Rank Search Race Search Genetic Search Random Forest ReliefF Fisher Filtering Other wrapper methods | – Reducing dimension – Choose effective features |
| Feature Extraction | Principal Component Analysis (PCA) Independent Component Analysis (ICA) Linear discriminant analysis (LDA) Latent semantic index (LSI) Singular Value Decomposition (SVD) Multidimensional Scaling (MDS) Partial Least Squares (PLS) Locally Linear Embedding (LLE) ISOMAP Factor analysis Kernel LDA Kernel PCA Kernel Partial Least Squares (KPLS) | – Noise reduction/removal – Reducing dimension |
Table 4
Astrostatistics and astroinformatics organizations.
| Organization | Under community or project | Foundation Time | Chair |
|---|---|---|---|
| International Astrostatistics Association (IAA) | The International Statistical Institute (ISI) | August 2012 | Joseph Hilbe |
| IAU Working Group in Astrostatistics and Astroinformatics | The International Astronomical Union (IAU) | August 2012 | Eric Feigelson |
| AAS Working Group in Astroinformatics and Astrostatistics | The American Astronomical Society (AAS) | June 2012 | Zeljko Ivezic |
| ASA Interest Group in Astrostatistics | The American Statistical Association (ASA) | March 2014 | Jessi Cisnewski |
| LSST Informatics and Statistics Science Collaboration | The Large Synoptic Survey Telescope (LSST) | Under construction | Kirk Borne |
| IAA Working Group on Cosmostatistics (renamed Cosmostatistics Initiative, short for COIN) | The International Astrostatistics Association (IAA) | April 2014 | Rafael de Souza |
