No Na in R forces the conversion of characters to numbers
I work in R and have a data frame DD with a digital vector_ 2006. When I first import data, I need to delete $, decimal point and some spaces from three variables: sumofcost, sumofcases and sumofunits To do this, I used str_ replace_ all. But once I use STR_ replace_ All, the vector is converted to characters So I use as Numeric (VaR) converts vectors to numbers, but introduces NAS, even if I'm running as When you run the following code before the numeric code, there is no Na in the vector
sum(is.na(dd_2006$SumOfCost)) [1] 0 sum(is.na(dd_2006$SumOfCases)) [1] 0 sum(is.na(dd_2006$SumOfUnits)) [1] 0
This is the imported code, starting with deleting $from the vector In the str (dd_2006) output, I deleted some variables for space, so the following str_ replace_ The columns #s in the all code do not match the output I published here (but they were executed in the original code):
library("stringr") dd_2006$SumOfCost <- str_sub(dd_2006$SumOfCost,2,) #2=the first # after the $ #Removes decimal pt,zero's after,and commas dd_2006[,9] <- str_replace_all(dd_2006[,9],".00","") dd_2006[,","") dd_2006[,10] <- str_replace_all(dd_2006[,10],11] <- str_replace_all(dd_2006[,11],"") str(dd_2006) 'data.frame': 12604 obs. of 14 variables: $CMHSP : Factor w/ 46 levels "Allegan","AuSable Valley",..: 1 1 1 $FY : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1 ... $Population : Factor w/ 1 level "DD": 1 1 1 1 1 1 1 1 1 1 ... $SumOfCases : chr "0" "1" "0" "0" ... $SumOfUnits : chr "0" "365" "0" "0" ... $SumOfCost : chr "0" "96416" "0" "0" ...
I found a reply to my here similar question using the following code:
# create dummy data.frame d <- data.frame(char = letters[1:5],fake_char = as.character(1:5),fac = factor(1:5),char_fac = factor(letters[1:5]),num = 1:5,stringsAsFactors = FALSE)
Let's take a look at data frame
> d char fake_char fac char_fac num 1 a 1 1 a 1 2 b 2 2 b 2 3 c 3 3 c 3 4 d 4 4 d 4 5 e 5 5 e 5
Let's run:
> sapply(d,mode) char fake_char fac char_fac num "character" "character" "numeric" "numeric" "numeric" > sapply(d,class) char fake_char fac char_fac num "character" "character" "factor" "factor" "integer"
Now you may ask yourself, "what's unusual?" Well, I've come across a lot of strange things in R. it's not the most confusing thing, but it will confuse you, especially if you read this before you go to bed
It says here: the first two columns are characters I deliberately called the second fake_ char. Find the similarity between this role variable and the variable created by Dirk in the reply It is actually a vector of numbers converted to characters Columns 3 and 4 are factors, and the last column is "pure" numbers
If you use the transform function, you can set fake_ Char is converted to numeric, but cannot be converted to char variable itself
> transform(d,char = as.numeric(char)) char fake_char fac char_fac num 1 NA 1 1 a 1 2 NA 2 2 b 2 3 NA 3 3 c 3 4 NA 4 4 d 4 5 NA 5 5 e 5 Warning message: In eval(expr,envir,enclos) : NAs introduced by coercion but if you do same thing on fake_char and char_fac,you'll be lucky,and get away with no NA's:
char fake_char fac char_fac num 1 a 1 1 1 1 2 b 2 2 2 2 3 c 3 3 3 3 4 d 4 4 4 4 5 e 5 5 5 5
So I tried the above code in my script, but still put forward Na (no warning message about coercion)
#changing sumofcases,cost,and units to numeric dd_2006_1 <- transform(dd_2006,SumOfCases = as.numeric(SumOfCases),SumOfUnits = as.numeric(SumOfUnits),SumOfCost = as.numeric(SumOfCost)) > sum(is.na(dd_2006_1$SumOfCost)) [1] 12 > sum(is.na(dd_2006_1$SumOfCases)) [1] 7 > sum(is.na(dd_2006_1$SumOfUnits)) [1] 11
I also use the table (dd_2006 $sumofcases) to view the observation results to see if there are any characters I missed in the observation, but there are no characters Ideas about why these news appear and how to get rid of them?
Solution
As Anando pointed out, the problem is in your data, and we can't help you without repeatable examples That is, this is a code snippet that can help you identify the records in the data that cause the problem:
test = as.character(c(1,3,4,'M')) v = as.numeric(test) # NAs intorduced by coercion ix.na = is.na(v) which(ix.na) # row index of our problem = 5 test[ix.na] # shows the problematic record,"M"
Instead of guessing why NAS was introduced, pull out the records that caused the problem and solve them directly / separately until Na disappeared
Update: it seems that the problem is your understanding of str_ replace_ Call to all I don't know the stringr library, but I think you can do the same thing with gsub:
v2 = c("1.00","2.00","3.00") gsub("\\.00","",v2) [1] "1" "2" "3"
I'm not quite sure what this will accomplish:
sum(as.numeric(v2)!=as.numeric(gsub("\\.00",v2))) # Illustrate that vectors are equivalent. [1] 0
Unless this serves some specific purpose for you, I recommend that you completely remove this step from the preprocessing because it seems unnecessary and poses a problem for you