এক্সএমএল প্যাকেজ ব্যবহার করে আরটিএমএল ফ্রেমগুলিতে এইচটিএমএল টেবিলগুলি স্ক্র্যাপ করা

153

এক্সএমএল প্যাকেজটি ব্যবহার করে আমি কীভাবে এইচটিএমএল টেবিলগুলি স্ক্র্যাপ করব?

উদাহরণস্বরূপ, ব্রাজিলিয়ান ফুটবল দলের এই উইকিপিডিয়া পৃষ্ঠাটি দেখুন । আমি এটি আর এ পড়তে চাই এবং "ফিফা স্বীকৃত দলগুলির বিরুদ্ধে ব্রাজিল যে সমস্ত ম্যাচ খেলেছে তার তালিকা" ডাটা.ফ্রেম হিসাবে সারণীটি পেতে চাই। কিভাবে আমি এটি করতে পারব?

— এদুয়ার্দো লিওনি
সূত্র

11

এক্সপ্যাথ নির্বাচকদের কাজ করতে, সিলেক্টআরজেডজেট.com/ দেখুন - এটি দুর্দান্ত

— হ্যাডলি

144

... বা একটি ছোট চেষ্টা:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

বাছাই করা টেবিলটি পৃষ্ঠার দীর্ঘতম একটি

tables[[which.max(n.rows)]]

— জিম জি।
সূত্র

এইচটিএমএলটিবল সহায়তা এইচটিএমএল পার্স (), getNodeSet (), টেক্সট সংযোগ () এবং read.table ()

— ডেভ এক্স

48

library(RCurl)
library(XML)

# Download page using RCurl
# You may need to set proxy details, etc.,  in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})

# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body 
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]

#In this case, the required table is the only one with class "wikitable sortable"  
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable  <- tables[which(tableclasses=="wikitable sortable")]$table

#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))

# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
   tablerow <- thetable$children[[i]]$children
   opponent <- tablerow[[1]]$children[[2]]$children$text$value
   others <- unname(sapply(tablerow[-1], function(x) x$children$text$value)) 
   content <- rbind(content, c(opponent, others))
}

# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)

যুক্ত করতে সম্পাদিত:

নমুনা আউটপুট

                     Opponent Played Won Drawn Lost Goals for Goals against  % Won
    1               Argentina     94  36    24   34       148           150  38.3%
    2                Paraguay     72  44    17   11       160            61  61.1%
    3                 Uruguay     72  33    19   20       127            93  45.8%
    ...

— রিচি সুতি
সূত্র

7

অন্য কেউ যারা সৌভাগ্যবান যথেষ্ট পোস্টটি এটি জন্য, এই স্ক্রিপ্ট সম্ভবত এই অন্য সহায়ক পোস্ট বর্ণনা অনুযায়ী চালানো হবে যদি না ব্যবহারকারী তাদের "user-agent" তথ্য যোগ করে,: stackoverflow.com/questions/9056705/...

— Rguy

26

এক্সপথ ব্যবহার করে অন্য একটি বিকল্প।

library(RCurl)
library(XML)

theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)

# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))

# Clean up the results
content[,1] <- gsub("Â ", "", content[,1])
tablehead <- gsub("Â ", "", tablehead)
names(content) <- tablehead

এই ফলাফল উত্পাদন করে

> head(content)
   Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina     94  36    24   34       148           150 38.3%
2  Paraguay     72  44    17   11       160            61 61.1%
3   Uruguay     72  33    19   20       127            93 45.8%
4     Chile     64  45    12    7       147            53 70.3%
5      Peru     39  27     9    3        83            27 69.2%
6    Mexico     36  21     6    9        69            34 58.3%

— learnr
সূত্র

এক্সপথ ব্যবহারের জন্য দুর্দান্ত কল। গৌণ বিন্দু: আপনি // * / to // পরিবর্তন করে পাথ যুক্তিটি কিছুটা সহজ করতে পারেন, যেমন "// টেবিল [@ শ্রেণি = 'উইকেটেবল সাজানোর ব্যবস্থা'] / টিআর / তম"

— রিচি কটন

আমি একটি ত্রুটি পেয়েছি "স্ক্রিপ্টগুলিতে যোগাযোগের তথ্য সহ তথ্যযুক্ত ব্যবহারকারী-এজেন্ট স্ট্রিং ব্যবহার করা উচিত, বা সেগুলি বিজ্ঞপ্তি ছাড়াই আইপি-ব্লক করা হতে পারে।" [2] "এই পদ্ধতিটি বাস্তবায়নের জন্য কি কোনও উপায় আছে?

— pssguy

2

বিকল্পগুলি (রিকরওলপশনস = তালিকা (ইউজারেজেন্ট = "জেডজেজজ")) অন্যান্য বিকল্প ও আলোচনার জন্য ওমেগাট.অর্গ / রিক্রাল / এফএকিউ html বিভাগ "রানটাইম" দেখুন ।

— শিক্ষার্থী

25

rvestবরাবর xml2এইচটিএমএল ওয়েব পেজ পার্স জন্য আরেকটি জনপ্রিয় প্যাকেজ।

library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)

xmlপ্যাকেজটির চেয়ে সিনট্যাক্সটি ব্যবহার করা সহজ এবং বেশিরভাগ ওয়েব পৃষ্ঠার জন্য প্যাকেজটি প্রয়োজনীয় সমস্ত বিকল্প সরবরাহ করে provides

— Dave2e
সূত্র

পঠন_এইচটিএমএল আমাকে ত্রুটি দেয় "'ফাইল: ///User/grieb/Auswertungen/tetyana-snp-2016/data/snp-nexus/15/SNP%20Annotation%20Tool.html' বর্তমান কার্যনির্বাহী ডিরেক্টরিতে বিদ্যমান নেই '(' / ব্যবহারকারীরা / grieb / Auswertungen / tetyana-snp-2016 / কোড ')। "

— scs