class: inverse, left, middle background-image: url(resources/images/cover.gif) background-size: cover # Análise de dados de COVID 19: <br> utilizando CP4D e DB2 ## Educathon 2022 <img src="resources/images/logo-ibm.png" width="150px"/> Thiago Pires | Abr 2022 <br><br><br> Repositório [github.com/th1460/ibm-educathon-2022](https://github.com/th1460/ibm-educathon-2022) <br> Slides [th1460.github.io/ibm-educathon-2022](https://th1460.github.io/ibm-educathon-2022) --- layout: true background-image: url(resources/images/logo-ibm.png) background-position: 97% 3% background-size: 7% --- # Thiago Pires .pull-left[ <img class="rounded-corners" src="resources/images/avatar.jpeg" width="400px"/> ] .pull-right[
] --- class: left middle background-color: #00b33c # Criando uma conta na IBM Cloud --- # Criando uma conta na IBM Cloud .pull-left[ - Acesse https://cloud.ibm.com - Clicar em <mark>Create an account</mark> <img src="resources/images/ibm-cloud-login.png" width="90%" /> ] .pull-right[ - Preencha o formulário com o email e informações pessoais - <mark>Não precisa incluir um cartão de crédito</mark> para utilizar os recursos da cloud necessários para este workshop! <img src="resources/images/ibm-cloud-registration.png" width="90%" /> ] --- class: left middle background-color: #ffcc31 # Cloud Pak for Data --- # Cloud Pak for Data >*O Cloud Pak for Data (CP4D) <mark>é uma plataforma de análise que ajuda a preparar dados para inteligência artificial (IA)</mark>. Ele permite que engenheiros de dados, administradores de dados, cientistas de dados e analistas de negócios colaborem usando uma plataforma multicloud integrada.* ## Acessando o recurso na Cloud .pull-left[ <img src="resources/images/cp4d.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <mark>Clicar em Launch</mark> para iniciar o recurso. <img src="resources/images/cp4d-overview.png" width="60%" style="display: block; margin: auto;" /> ] --- # Bem-vindo ao Cloud Pak for Data <img src="resources/images/cp4d-welcome.png" width="50%" style="display: block; margin: auto;" /> ## Criar um projeto .pull-left[ <img src="resources/images/cp4d-project.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ Em seguida: - Definir um none (mandatório) - Colocar uma descrição (opcional) - Associar um Cloud Object Storage (mandatório) ] --- # Visão geral do projeto <img src="resources/images/cp4d-assets.png" width="90%" style="display: block; margin: auto;" /> --- # Criar serviços .pull-left[ - Clicar em <mark>Create a service</mark>. <img src="resources/images/cp4d-service.png" width="90%" style="display: block; margin: auto;" /> - Adicionar <mark>Watson Studio</mark> <img src="resources/images/cp4d-watson-studio.png" width="90%" style="display: block; margin: auto;" /> - Ir para https://dataplatform.cloud.ibm.com/projects/ e clicar no projeto criado anteriormente ] .pull-right[ - Ir para <mark>Launch IDE>RStudio</mark> <img src="resources/images/cp4d-ide.png" width="90%" style="display: block; margin: auto;" /> - <mark>Launch RStudio</mark> com a configuração sugerida <img src="resources/images/cp4d-rstudio.png" width="90%" style="display: block; margin: auto;" /> ] --- # RStudio <img src="resources/images/cp4d-rstudio-start.png" width="100%" style="display: block; margin: auto;" /> --- # Adicionando o DB2 no projeto >*O DB2 é um <mark>banco de dados SQL</mark> em cloud totalmente gerenciado e com escalabilidade sob demanda de computação e armazenamento.* .pull-left[ - Ir na <mark>pesquisa por recursos</mark> <img src="resources/images/cp4d-db2.png" width="90%" style="display: block; margin: auto;" /> - O DB2 estará listado como <mark>ativo</mark> <img src="resources/images/cp4d-db2-active.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ - <mark>Criar credencial</mark> de acesso <img src="resources/images/cp4d-db2-credential.png" width="90%" style="display: block; margin: auto;" /> >*<mark>SQL significa Standard Query Language</mark>, literalmente a linguagem padrão para realizar queries. A linguagem SQL é utilizada de maneira relativamente parecida entre os principais bancos de dados relacionais do mercado* >*<mark>Bancos de dados relacionais</mark> são baseados no modelo relacional, uma maneira intuitiva e direta de representar dados em tabelas.* ] --- # Preparando o ambiente Baixar repositório no RStudio ```bash git clone https://github.com/th1460/ibm-educathon-2022.git ``` Instalar bibliotecas que serão utilizadas no projeto ```r install.packages("renv") renv::restore() ``` --- class: left middle background-color: #d74c3c # Análises --- # Ler dados bruto ```r require(magrittr, include.only = "%>%") require(dbplyr) url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv" covid19 <- read.csv(url, stringsAsFactors = FALSE) covid19 %>% dplyr::select(1:9) %>% dplyr::glimpse() ``` ``` Rows: 284 Columns: 9 $ Province.State <chr> "", "", "", "", "", "", "", "", "", "Australian Capital… $ Country.Region <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola… $ Lat <dbl> 33.93911, 41.15330, 28.03390, 42.50630, -11.20270, -71.… $ Long <dbl> 67.709953, 20.168300, 1.659600, 1.521800, 17.873900, 23… $ X1.22.20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ X1.23.20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ X1.24.20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ X1.25.20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ X1.26.20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 0… ``` --- # Preparar dados .pull-left[ ```r (covid19 <- covid19 %>% dplyr::filter(Country.Region %in% c("Brazil", "Peru", "Bolivia", "Chile", "Argentina", "Colombia", "Venezuela", "Ecuador", "Uruguay", "Paraguay")) %>% dplyr::select(- Province.State) %>% tidyr::pivot_longer(!Country.Region:Long, names_to = "date", values_to = "cumulate") %>% dplyr::group_by(Country.Region) %>% dplyr::mutate(date = gsub("X", "", date) %>% gsub("\\.", "-", .) %>% lubridate::mdy(), value = cumulate - dplyr::lag(cumulate)) %>% dplyr::ungroup() %>% dplyr::rename(country = Country.Region)) ``` ] .pull-right[ ``` # A tibble: 8,260 × 6 country Lat Long date cumulate value <chr> <dbl> <dbl> <date> <int> <int> 1 Argentina -38.4 -63.6 2020-01-22 0 NA 2 Argentina -38.4 -63.6 2020-01-23 0 0 3 Argentina -38.4 -63.6 2020-01-24 0 0 4 Argentina -38.4 -63.6 2020-01-25 0 0 5 Argentina -38.4 -63.6 2020-01-26 0 0 6 Argentina -38.4 -63.6 2020-01-27 0 0 7 Argentina -38.4 -63.6 2020-01-28 0 0 8 Argentina -38.4 -63.6 2020-01-29 0 0 9 Argentina -38.4 -63.6 2020-01-30 0 0 10 Argentina -38.4 -63.6 2020-01-31 0 0 # … with 8,250 more rows ``` ] --- # Salvar dados no DB2 ```r # Ler variáveis de ambiente readRenviron("../.Renviron") # Criar conexão com o banco drv <- RJDBC::JDBC("com.ibm.db2.jcc.DB2Driver", "../jars/db2jcc4.jar") host <- Sys.getenv("DB2_HOST") user <- Sys.getenv("DB2_USER") password <- Sys.getenv("DB2_PASSWORD") uri <- sprintf("jdbc:db2://%s/bludb:user=%s;password=%s;sslConnection=true;", host, user, password) db2 <- DBI::dbConnect(drv, uri) ``` ```r # Escrever tabela no banco de dados DBI::dbWriteTable(db2, "COVID19", value = covid19, overwrite = TRUE) ``` --- # Ler dados do DB2 ```r dplyr::tbl(db2, "COVID19") %>% dplyr::filter(COUNTRY == "Brazil") ``` ``` # Source: lazy query [?? x 6] # Database: JDBCConnection COUNTRY LAT LONG DATE CUMULATE VALUE <chr> <dbl> <dbl> <chr> <dbl> <dbl> 1 Brazil -14.2 -51.9 2020-01-22 0 NA 2 Brazil -14.2 -51.9 2020-01-23 0 0 3 Brazil -14.2 -51.9 2020-01-24 0 0 4 Brazil -14.2 -51.9 2020-01-25 0 0 5 Brazil -14.2 -51.9 2020-01-26 0 0 6 Brazil -14.2 -51.9 2020-01-27 0 0 7 Brazil -14.2 -51.9 2020-01-28 0 0 8 Brazil -14.2 -51.9 2020-01-29 0 0 9 Brazil -14.2 -51.9 2020-01-30 0 0 10 Brazil -14.2 -51.9 2020-01-31 0 0 # … with more rows ``` --- # Análises ```r brazil <- dplyr::tbl(db2, "COVID19") %>% dplyr::filter(COUNTRY == "Brazil") %>% dplyr::as_tibble() %>% dplyr::mutate(DATE = DATE %>% lubridate::ymd()) ``` ## Distribuição do número de casos diários .pull-left[ ```r brazil %>% ggplot2::ggplot() + ggplot2::aes(DATE, VALUE) + ggplot2::geom_line() + ggplot2::theme_minimal() + ggplot2::labs(x = "Data", y = "# de casos de COVID 19") ``` ] .pull-right[ <img src="index_files/figure-html/plot1-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Análises ## Distribuição da média móvel de 7 dias de casos ```r brazil <- brazil %>% * dplyr::mutate(MOVING_AVERAGE = zoo::rollmean(VALUE, 7, align = "right", fill = NA)) ``` .pull-left[ ```r brazil %>% ggplot2::ggplot() + ggplot2::aes(DATE, MOVING_AVERAGE) + ggplot2::geom_line() + ggplot2::theme_minimal() + ggplot2::labs(x = "Data", y = "Média móvel de casos de COVID 19") ``` ] .pull-right[ <img src="index_files/figure-html/plot2-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Análises ## Valor máximo da distribuição de média móvel de 7 dias de casos ```r brazil %>% dplyr::filter(MOVING_AVERAGE == max(MOVING_AVERAGE, na.rm = TRUE)) %>% dplyr::select(COUNTRY, DATE, MOVING_AVERAGE) %>% reactable::reactable() ```
--- # Análises ## Distribuição da média móvel de 7 dias de casos (plotly) .pull-left[ ```r gg <- brazil %>% ggplot2::ggplot() + ggplot2::aes(DATE, MOVING_AVERAGE) + ggplot2::geom_line() + ggplot2::theme_minimal() + ggplot2::labs(x = "Data", y = "Média móvel de casos de COVID 19") gg %>% * plotly::ggplotly() ``` ] .pull-right[
] --- # Análises ## Variação entre a média móvel mais atual e a média móvel anterior ```r delta <- dplyr::tbl(db2, "COVID19") %>% dplyr::filter(DATE %in% c("2022-04-22", "2022-04-23")) %>% dplyr::as_tibble() %>% dplyr::mutate(DATE = DATE %>% lubridate::ymd(), MOVING_AVERAGE = zoo::rollmean(VALUE, 7, align = "right", fill = NA)) %>% dplyr::group_by(COUNTRY) %>% * dplyr::mutate(`DELTA%` = (MOVING_AVERAGE - dplyr::lag(MOVING_AVERAGE))/dplyr::lag(MOVING_AVERAGE) * 100, `DELTA%` = ifelse(is.na(`DELTA%`), 0, round(`DELTA%`, 1))) %>% dplyr::filter(DATE == "2022-04-23") gg <- delta %>% dplyr::select(COUNTRY, `DELTA%`) %>% ggplot2::ggplot() + ggplot2::aes(COUNTRY, `DELTA%`) + ggplot2::geom_col(fill = ifelse(delta$`DELTA%` > 0, "#cb6e5d", "#005c5b")) + ggplot2::theme_minimal() + ggplot2::labs(x = "Países", y = "Delta %") plotly::ggplotly(gg) ``` --- # Análises ## Variação entre a média móvel mais atual e a média móvel anterior
--- # Análises ## Mapa com as variações percentuais .pull-left[ ```r conpal <- leaflet::colorNumeric(palette = "RdBu", domain = delta$`DELTA%`, na.color = "black", reverse = TRUE) delta %>% dplyr::select(COUNTRY, LAT, LONG, `DELTA%`) %>% leaflet::leaflet() %>% leaflet::addProviderTiles("CartoDB.DarkMatter") %>% leaflet::addCircleMarkers(~LONG, ~LAT, label = paste(delta$COUNTRY, " ", round(delta$`DELTA%`, 1)), color = ~conpal(delta$`DELTA%`)) %>% leaflet::addLegend(position = "bottomleft", title = "Delta %", pal = conpal, values = delta$`DELTA%`, opacity = 0.8) ``` ] .pull-right[
] --- # Funções SQL ```r dplyr::tbl(db2, "COVID19") %>% * dplyr::mutate(REF_DATE = dplyr::sql("VARCHAR_FORMAT(LAST_DAY(TO_DATE(DATE, 'YYYYMMDD')), 'YYYY-MM-DD')")) %>% dplyr::group_by(COUNTRY, REF_DATE) %>% dplyr::summarise(SUM = sum(VALUE, na.rm = TRUE), .groups = "drop") %>% dplyr::filter(COUNTRY == "Brazil") ``` ``` # Source: lazy query [?? x 3] # Database: JDBCConnection COUNTRY REF_DATE SUM <chr> <chr> <dbl> 1 Brazil 2020-01-31 0 2 Brazil 2020-02-29 2 3 Brazil 2020-03-31 5715 4 Brazil 2020-04-30 81470 5 Brazil 2020-05-31 428950 6 Brazil 2020-06-30 896532 7 Brazil 2020-07-31 1257782 8 Brazil 2020-08-31 1244378 9 Brazil 2020-09-30 902536 10 Brazil 2020-10-31 720398 # … with more rows ``` --- class: inverse center middle background-color: #000 .pull-left[ # Obrigado! ] .pull-right[ [th1460.github.io](https://th1460.github.io/)<br> [github.com/th1460](https://github.com/)<br> [medium.com/@thopr](https://medium.com/@thopr)<br> [linkedin.com/in/thop](https://www.linkedin.com/in/thop)<br><br> slides [th1460.github.io/ibm-educathon-2022](https://th1460.github.io/ibm-educathon-2022) ]