looking at data

when we work with a new dataset we should LOOK at the data first. The format? The dimensions? The variable names? How are the variables stored? Is there missing data? Are there any flaws?

FIRST we look at the data class

> class(plants)
[1] "data.frame"

Its very common for data to be stored in data frames. its in fact the default class for reading data into R using functions like read.csv() and read.table()

it also tells us this is 2D and fits neatly into rows and columns

> dim(plants)
[1] 5166   10

yoooooo 5166 rows lmaooo

size if you are curious

> object.size(plants)
745944 bytes

to get the column names

> names(plants)
 [1] "Scientific_Name"      "Duration"             "Active_Growth_Period" "Foliage_Color"        "pH_Min"               "pH_Max"              
 [7] "Precip_Min"           "Precip_Max"           "Shade_Tolerance"      "Temp_Min_F"          

now you cant look at the whole thing obviously so just go and peek at it a little

> head(plants)
               Scientific_Name          Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min Precip_Max Shade_Tolerance Temp_Min_F
1                  Abelmoschus              <NA>                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
2       Abelmoschus esculentus Annual, Perennial                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
3                        Abies              <NA>                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
4               Abies balsamea         Perennial    Spring and Summer         Green      4      6         13         60        Tolerant        -43
5 Abies balsamea var. balsamea         Perennial                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
6                     Abutilon              <NA>                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA

#use 
head(plants, 10) 
#to get the first 10 rows 

if you want to preview the end of the dataset you do

> tail(plants, 15)  #default is 6 rows
                      Scientific_Name  Duration Active_Growth_Period Foliage_Color pH_Min pH_Max Precip_Min Precip_Max Shade_Tolerance Temp_Min_F
5152                          Zizania      <NA>                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5153                 Zizania aquatica    Annual               Spring         Green    6.4    7.4         30         50      Intolerant         32
5154   Zizania aquatica var. aquatica    Annual                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5155                Zizania palustris    Annual                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5156 Zizania palustris var. palustris    Annual                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5157                      Zizaniopsis      <NA>                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5158             Zizaniopsis miliacea Perennial    Spring and Summer         Green    4.3    9.0         35         70      Intolerant         12
5159                            Zizia      <NA>                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5160                     Zizia aptera Perennial                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5161                      Zizia aurea Perennial                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5162                 Zizia trifoliata Perennial                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5163                          Zostera      <NA>                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5164                   Zostera marina Perennial                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5165                           Zoysia      <NA>                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA
5166                  Zoysia japonica Perennial                 <NA>          <NA>     NA     NA         NA         NA            <NA>         NA

DAMN NOW SUMMARY IS INSANE IT LITERALLY GIVES THE SUMMARY OF THE WHOLE DATASET

> summary(plants)
 Scientific_Name      Duration         Active_Growth_Period Foliage_Color          pH_Min          pH_Max         Precip_Min      Precip_Max    
 Length:5166        Length:5166        Length:5166          Length:5166        Min.   :3.000   Min.   : 5.100   Min.   : 4.00   Min.   : 16.00  
 Class :character   Class :character   Class :character     Class :character   1st Qu.:4.500   1st Qu.: 7.000   1st Qu.:16.75   1st Qu.: 55.00  
 Mode  :character   Mode  :character   Mode  :character     Mode  :character   Median :5.000   Median : 7.300   Median :28.00   Median : 60.00  
                                                                               Mean   :4.997   Mean   : 7.344   Mean   :25.57   Mean   : 58.73  
                                                                               3rd Qu.:5.500   3rd Qu.: 7.800   3rd Qu.:32.00   3rd Qu.: 60.00  
                                                                               Max.   :7.000   Max.   :10.000   Max.   :60.00   Max.   :200.00  
                                                                               NA's   :4327    NA's   :4327     NA's   :4338    NA's   :4338    
 Shade_Tolerance      Temp_Min_F    
 Length:5166        Min.   :-79.00  
 Class :character   1st Qu.:-38.00  
 Mode  :character   Median :-33.00  
                    Mean   :-22.53  
                    3rd Qu.:-18.00  
                    Max.   : 52.00  
                    NA's   :4328