I have a dataset of text entries with id-s. I have a list of match vectors, each consisting of words.
I want to count the number of times words from each match vector occur in each entry.
There are about ten match vectors with 10-100 words in each.
The dataset has about 10^8 entries in it. The entries range in length between 1:600 words with a median of about 50.
It's a lot of data, basically.
I have a solution for a small (say, 10^6 entries) dataset. But it scales horribly.
Here's an approximate reprex.
library(tidyverse)
library(magrittr)
# match vectors
fruit = c('apple', 'banana', 'cherry')
vegetable = c('artichoke', 'bean', 'carrot')
food_list = list(fruit, vegetable)
# we make up some data to match
dummy = tibble(
id = 1:10
) %>%
rowwise() %>%
mutate(
entry = paste(
paste(
sample(fruit,
sample(0:5, 1),
replace = T
),
collapse = ' '),
paste(
sample(vegetable,
sample(0:5, 1),
replace = T
),
collapse = ' '),
paste(
sample(c('filler1', 'filler2', 'filler3'),
sample(0:5, 1),
replace = T
),
collapse = ' '),
sep = ' '
)
)
The text entries are rows in a large table. I can map through the rows, check each text entry against each match vector, count the total amount of matches per vector, where the match count of the entry "apple banana banana chair" on the match vector c("apple", "banana", "cherry") is 3. I can store these integers in a list column.
# map through data, count overlaps
getMatches = function(dummy){
dummy %>%
mutate(
counts =
list(
map_dbl(food_list,
~ str_count(entry, .) %>%
sum()
)
)
)
}
res = getMatches(dummy)
This slows down fast.
# larger sets
dummy10 = dummy %>%
slice(rep(1:n(), each = 10))
dummy100 = dummy %>%
slice(rep(1:n(), each = 100))
dummy1000 = dummy %>%
slice(rep(1:n(), each = 1000))
dummy10000 = dummy %>%
slice(rep(1:n(), each = 10000))
dummies = list(dummy, dummy10, dummy100, dummy1000, dummy10000)
# times
getTimes = function(dummy){
tictoc::tic('time')
getMatches(dummy)
}
map(dummies, ~ getTimes(.))
# get time: 0.012 sec elapsed
# get time: 0.016 sec elapsed
# get time: 0.104 sec elapsed
# get time: 0.926 sec elapsed
# get time: 8.569 sec elapsed
What can I do? I can obviously parallelise this, or replace dplyr with data table, or use awk, but I feel like there are fundamental problems with the approach.
Or maybe not, it's just matching a lot of text with a lot of text just takes a very long time?