This is a guide to text mining in the R programming language, designed especially for students and researchers using Japanese-language sources. For R novices, the material below is ideal for a full-day workshop. For a four-hour, half-day workshop, at a brisk pace, I’d recommend sections 1 to 6 and then skipping to section 9. I update this guide regularly, so please contact me with suggestions for improvements at mark.ravina@austin.utexas.edu.
What is R and why use it? R is an immensely powerful but idiosyncratic language. Unlike Python, which is consistent and logical, R is full of strange quirks. That is because R was written by statisticians, who just wanted to solve some problems, rather than by computer scientists or software engineers, who would have wanted the language itself to be clear, consistent, and elegant. As a result, R can be more like a human language than an artificial language. For example, in English, the plural of hat is hats, but the plural of alumnus is alumni, while the plural of alumna is alumnae, and the plural of deer is deer. Is that confusing and arbitrary? Yes. Does that stop you from communicating ideas in English? I didn’t think so.
Why learn a difficult and quirky language? Because R is wonderfully helpful in tackling a broad range of problems in the humanities and social sciences: statistical analysis, text mining, data visualization, spatial analysis, network analysis, and even image analysis. R is supported by a sprawling base of users who write and share packages: bundles of code designed to solve specific problems. In March 2017 alone, users posted 850 new or updated packages on subjects ranging from political science, oceanography, genetics, and linguistics. For many complex problems, you can rely on another researcher’s code.
Finally, learning R is made much easier by RStudio, an IDE, or “integrated development environment.” RStudio combines the best of a GUI with an old-school command line interface. Furthermore, if you start typing in RStudio and hit the tab key, it will try to auto-complete your code with a suitable command or variable. It’s like training wheels that appear only when you need them.
For small-scale projects, it can be easier to use a GUI than to code in R. For example, it’s easy to do some basic text mining through websites such as Voyant Tools or the Philologica interface to the Aozora Bunko. Both have “point and click” interfaces that make the basic analysis of single texts extremely easy. The limit of such GUI tools is scale. If you want, for example, to count all the times “社会” appears near “女性” in a single text, it’s much easier to point and click than to write code. But what if you want to do that for dozens of texts over a range of terms, and then aggregate and manipulate the results? With the GUI you will need to redo all your “point and clicks” for each text and term, but once you have working code, you will just need to change a few variables and rerun the code. And what if, months or years later, you want to recheck your results before publishing your research? Could you recreate exactly how you once navigated a GUI? Probably not. With R, just inspect and re-run your code.
In this guide, words in boldface are important specialized terms, especially those where the technical meaning differs from the vernacular meaning. Working code appears in grey-shaded boxes, while output appears in shaded boxes and is preceded by ##. This guide assumes a very basic familiarity with the RStudio IDE and R. There are a number of free, on-line courses that cover those basics. See, for example, the “Introduction to R” and “Working with the RStudio IDE (Part 1)” at Datacamp.
Let’s start with a tidy, pre-processed text, the famous nineteenth-century journal Meiroku zasshi 明六雑誌。This text was prepared by NINJAL (National Institute for Japanese Language and Linguistics 国立国語研究所) as a sample corpus for modern Japanese 近代語 (in contrast to contemporary Japanese 現代語). (For details, see the NINJAL website. Critical for our purposes, NINJAL has “tokenized” the Meiroku zasshi, broken the text into “tokens” or words. For example, the string “洋字を以て國語を書する” is whitespace delimited as “洋字 を 以 て 國語 を 書する”. We’ll defer the fraught and demanding tasks of inputting, cleaning, tokenizing, and “chunking” texts, and start by building on the wonderful work of NINJAL.
We will want to write and save our code, so let’s open a new code window in the top left of RStudio. You can use the menus (File -> New File -> R Script) of shortcut keys: Command+Shift+N. For a list of shortcuts, see https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts. Now you can just copy and paste code into that window. The code below will load the Meiroku zasshi (technically to read it into your local environment). To run a line of code, just get the cursor in that line and hit the run icon in the top left pane of RStudio, or use the shortcut keys: Ctrl+Enter on Windows and Command+Enter on Mac.
Meiroku.df <- read.table("http://laits.utexas.edu/~mr56267/Japanese_Text_Mining/Data/meiroku_zasshi.txt",
header = TRUE, stringsAsFactors = FALSE)
You just pulled the complete Meiroku zasshi into RStudio! You can see the journal in the top right window of the RStudio environment. You can double-click on word Meiroku.df to make Meiroku.df appear in the top left window, the source pane.
Some of the syntax may be self-evident. The read.table command tells R to read in a table. The details within the parentheses (known as arguments) tell R that the data has a header: the first line includes the variables names, as in a good spreadsheet.
As you can see, we now have the Meiroku zasshi in a data frame called Meiroku.df. A data frame is similar to a spreadsheet, although a software engineer might cringe at that statement. In a data frame, each row is a case or observation and each column is a variable. We’ll come back to those terms many times. For now, just understand that the Meiroku zasshi articles have been put into a grid, with each article and related metadata in its own row. Metadata is information about the text, such as the author.
We identify columns of a data frame by combining the name of the data frame and the name of the column, joined by the $ mark. The $ mark does NOT refer to money. Rather it tells R that a given column (or vector) is associated with a specific data frame. Meiroku.df$author
means the column author
in the data frame Meiroku.df
Run this line of code to see:
Meiroku.df$author
## [1] "西周" "西村茂樹" "加藤弘之" "森有礼" "津田真道"
## [6] "西周" "森有礼" "西村茂樹" "森有礼" "杉亨二"
## [11] "津田真道" "西周" "箕作麟祥" "加藤弘之" "杉亨二"
## [16] "西周" "西周" "津田真道" "西周" "杉亨二"
## [21] "箕作麟祥" "加藤弘之" "津田真道" "西周" "加藤弘之"
## [26] "森有礼" "柴田昌吉" "森有礼" "加藤弘之" "箕作麟祥"
## [31] "杉亨二" "津田真道" "清水卯三郎" "津田真道" "森有礼"
## [36] "箕作秋坪" "杉亨二" "西周" "津田真道" "津田真道"
## [41] "箕作麟祥" "西周" "津田真道" "津田真道" "杉亨二"
## [46] "中村正直" "阪谷素" "津田真道" "森有礼" "中村正直"
## [51] "阪谷素" "西周" "津田真道" "中村正直" "加藤弘之"
## [56] "津田真道" "阪谷素" "西周" "箕作麟祥" "杉亨二"
## [61] "津田真道" "森有礼" "中村正直" "阪谷素" "津田真道"
## [66] "津田真道" "杉亨二" "中村正直" "西周" "神田孝平"
## [71] "津田真道" "西周" "津田真道" "加藤弘之" "杉亨二"
## [76] "阪谷素" "西周" "神田孝平" "西周" "神田孝平"
## [81] "阪谷素" "杉亨二" "津田真道" "森有礼" "阪谷素"
## [86] "阪谷素" "西周" "福沢諭吉" "津田真道" "杉亨二"
## [91] "阪谷素" "西周" "津田真道" "阪谷素" "清水卯三郎"
## [96] "神田孝平" "西周" "神田孝平" "中村正直" "津田真道"
## [101] "杉亨二" "西周" "阪谷素" "津田真道" "福沢諭吉"
## [106] "津田真道" "神田孝平" "森有礼" "阪谷素" "阪谷素"
## [111] "西村茂樹" "西周" "西村茂樹" "柏原孝章" "森有礼"
## [116] "津田真道" "柏原孝章" "中村正直" "加藤弘之" "西村茂樹"
## [121] "柏原孝章" "福沢諭吉" "西周" "阪谷素" "森有礼"
## [126] "中村正直" "西村茂樹" "柏原孝章" "神田孝平" "杉亨二"
## [131] "神田孝平" "津田真道" "中村正直" "阪谷素" "津田真道"
## [136] "阪谷素" "西村茂樹" "西村茂樹" "中村正直" "神田孝平"
## [141] "西周" "阪谷素" "西周" "西村茂樹" "中村正直"
## [146] "西周" "阪谷素" "津田真道" "津田仙" "阪谷素"
## [151] "西村茂樹" "西周" "津田真道" "西村茂樹" "阪谷素"
You’ll note that the output is the list of Meiroku zasshi authors, with multiple appearances if they wrote more than one article. The output is 155 elements long because there are 155 articles. Here’s a quick and intuitive way to get a more compact list
unique(Meiroku.df$author)
## [1] "西周" "西村茂樹" "加藤弘之" "森有礼" "津田真道"
## [6] "杉亨二" "箕作麟祥" "柴田昌吉" "清水卯三郎" "箕作秋坪"
## [11] "中村正直" "阪谷素" "神田孝平" "福沢諭吉" "柏原孝章"
## [16] "津田仙"
The vector Meiroku.df$author is a one dimensional data object, so if we want to grab a single element we just need one number. We indicate the element’s location using square brackets. Thus, for the author of the second Meiroku zasshi article
Meiroku.df$author[2]
## [1] "西村茂樹"
To get a range of values use the c operator, where “c” stands for “combine” or “concatenate.” The range 2 through 5 inclusive is specified with c(2:5), while the values 2 and 5 are written with a comma, c(2,5). You can sometimes skip the c operator, but it never hurts to include it.
Meiroku.df$author[c(2:5)]
## [1] "西村茂樹" "加藤弘之" "森有礼" "津田真道"
Data frames are two-dimensional objects, so identifying an element requires two markers: first the row number(s), then the column number(s). The author information is in the 4th column of the data frame Meiroku.df, so to get the author of the second article:
Meiroku.df[2,4]
## [1] "西村茂樹"
In generic form, the syntax is
Name_of_data_frame[row_number, column_number]
The syntax for ranges works here as well:
Meiroku.df[c(2,6:8,10),c(3:4)]
## title author
## 2 開化の度に因て改文字を発すべきの論 西村茂樹
## 6 非学者職分論 西周
## 7 開化第一話 森有礼
## 8 陳言一則 西村茂樹
## 10 峨国彼得王の遺訓 杉亨二
One of the idiosyncrasies of R is that, within brackets, nothing means everything. If for example, we want ALL the rows for column 3, the syntax is [,4]. For example:
Meiroku.df[,4]
## [1] "西周" "西村茂樹" "加藤弘之" "森有礼" "津田真道"
## [6] "西周" "森有礼" "西村茂樹" "森有礼" "杉亨二"
## [11] "津田真道" "西周" "箕作麟祥" "加藤弘之" "杉亨二"
## [16] "西周" "西周" "津田真道" "西周" "杉亨二"
## [21] "箕作麟祥" "加藤弘之" "津田真道" "西周" "加藤弘之"
## [26] "森有礼" "柴田昌吉" "森有礼" "加藤弘之" "箕作麟祥"
## [31] "杉亨二" "津田真道" "清水卯三郎" "津田真道" "森有礼"
## [36] "箕作秋坪" "杉亨二" "西周" "津田真道" "津田真道"
## [41] "箕作麟祥" "西周" "津田真道" "津田真道" "杉亨二"
## [46] "中村正直" "阪谷素" "津田真道" "森有礼" "中村正直"
## [51] "阪谷素" "西周" "津田真道" "中村正直" "加藤弘之"
## [56] "津田真道" "阪谷素" "西周" "箕作麟祥" "杉亨二"
## [61] "津田真道" "森有礼" "中村正直" "阪谷素" "津田真道"
## [66] "津田真道" "杉亨二" "中村正直" "西周" "神田孝平"
## [71] "津田真道" "西周" "津田真道" "加藤弘之" "杉亨二"
## [76] "阪谷素" "西周" "神田孝平" "西周" "神田孝平"
## [81] "阪谷素" "杉亨二" "津田真道" "森有礼" "阪谷素"
## [86] "阪谷素" "西周" "福沢諭吉" "津田真道" "杉亨二"
## [91] "阪谷素" "西周" "津田真道" "阪谷素" "清水卯三郎"
## [96] "神田孝平" "西周" "神田孝平" "中村正直" "津田真道"
## [101] "杉亨二" "西周" "阪谷素" "津田真道" "福沢諭吉"
## [106] "津田真道" "神田孝平" "森有礼" "阪谷素" "阪谷素"
## [111] "西村茂樹" "西周" "西村茂樹" "柏原孝章" "森有礼"
## [116] "津田真道" "柏原孝章" "中村正直" "加藤弘之" "西村茂樹"
## [121] "柏原孝章" "福沢諭吉" "西周" "阪谷素" "森有礼"
## [126] "中村正直" "西村茂樹" "柏原孝章" "神田孝平" "杉亨二"
## [131] "神田孝平" "津田真道" "中村正直" "阪谷素" "津田真道"
## [136] "阪谷素" "西村茂樹" "西村茂樹" "中村正直" "神田孝平"
## [141] "西周" "阪谷素" "西周" "西村茂樹" "中村正直"
## [146] "西周" "阪谷素" "津田真道" "津田仙" "阪谷素"
## [151] "西村茂樹" "西周" "津田真道" "西村茂樹" "阪谷素"
That strange omission can look like a mistake, but it’s a powerful tool. Finally, you can specify columns by name
Meiroku.df[c(1:6),"year"]
## [1] 1874 1874 1874 1874 1874 1874
Experiment on your own, choosing different parts of the data frame using different arguments.
Giving a variable a value is formally known in R (and other programming languages) as assignment. The assignment operator in R is a combination of the “less than” sign and the dash: <- Thus, setting x equal to 3 is:
x <- 3
If you want to give x a series of values, use the c operator:
x <- c(1,3)
To assign an alphanumeric value (or series of values), use quotation marks:
city_names <- c("Tokyo","Moscow","Des Moines")
The objects x and city_names are known as vectors. If you look at the Values section of the Environment tab in the top right pane, you’ll see that city_names is marked as chr (meaning character), while x is marked as num (meaning numeric). Setting aside a detailed discussion of data types, it’s important to remember the following: vectors must be homogeneous. For present purposes, that means a vector must be either all pure numbers or pure characters (alphanumeric). If there’s a single letter in a largely numeric vector, R will treat the entire vector as chr and refuse to use it in mathematical operators. A related point is that each column (or row) in a data frame can be treated as a vector. Data frames are basically neatly rectangular bundles of vectors.
Let’s build on that basic knowledge of vectors, by asking R about Meiroku zasshi authors. For example, which elements of Meiroku.df$author are equal to Nishi Amane 西周? To ask that question in R we use a doubled equals sign.
Meiroku.df$author=="西周"
## [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [13] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [73] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [97] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [145] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
There’s an enormous difference between double and single equals signs. A single equals sign is more of a command, telling R to make Meiroku.df$author equal to Nishi Amane. That would actually overwrite all the author values! For that reason, even though = will sometimes work to assign values, it’s safer and clearer to use the assignment operator <-.
R answers our query with a logical vector: a series of TRUE/FALSE responses. Nishi Amane is the author of the first element, but not the second, etc. This is accurate, but not especially useful. We can, however, use this information to get the index numbers of the elements for which the answer is TRUE.
which(Meiroku.df$author=="西周")
## [1] 1 6 12 16 17 19 24 38 42 52 58 69 72 77 79 87 92 97 102
## [20] 112 123 141 143 146 152
This is a more valuable vector, since we can use it to extract and subset data. Let’s assign those index numbers to the variable special_subset
special_subset <- which(Meiroku.df$author=="西周")
Now we can use that vector to get the titles of all the Meiroku zasshi articles written by Nishi Amane
Meiroku.df$title[special_subset]
## [1] "洋字を以て国語を書するの論" "非学者職分論"
## [3] "駁旧相公議一題" "教門論(一)"
## [5] "煉火石造の説" "教門論(二)"
## [7] "教門論(三)" "教門論(四)"
## [9] "教門論(五)" "教門論(六)"
## [11] "知説(一)" "愛敵論"
## [13] "知説(二)" "情実説"
## [15] "秘密説" "知説(三)"
## [17] "知説(四)" "内地旅行(十一月十六日演説)"
## [19] "知説(五)" "網羅議院の説"
## [21] "国民気風論" "人世三宝説(一)"
## [23] "人世三宝説(二)" "人世三宝説(三)"
## [25] "人世三宝説(四)"
This is a key concept, worthy of review.
Let’s try this methods to get data for years.
Meiroku.df$year[special_subset]
## [1] 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874
## [16] 1874 1874 1874 1874 1875 1875 1875 1875 1875 1875
If we want, we can create an entirely new data frame, consisting just of articles written by Nishi Amane
Nishi_articles.df <- Meiroku.df[special_subset,]
Remember that the syntax for a data frame is
Name_of_data_frame[row_number, column_number]
and that nothing after the comma means “everything.” So we just took all of the columns of Meiroku.df but just some of the rows. If you want a denser syntax, you can skip the intermediate step of creating the vector special_subset. Just put the selection criteria right in the brackets
Nishi_articles.df <- Meiroku.df[which(Meiroku.df$author=="西周"),]
Programmers love dense code like that and they esteem “one-liners,” extremely compact, powerful code snippets. But, at least at first, it can be much easier to code in small incremental steps.
Having subset Meiroku.df by author, we can repeat the process, subsetting by year. We’ll use the Nishi_articles.df data frame, and include the selection criteria right in the brackets.
Nishi_articles_1874.df <- Meiroku.df[which(Nishi_articles.df$year==1874),]
Can you discern the selection criteria? Take a moment to experiment, selecting different years and different authors, and creating your own subsets of Meiroku.df.
In order to do more sophisticated text mining, we’ll rely on some packages and their functions. Setting aside the technicalities, functions are commands and packages are bundles of related functions. In order to use a package we need to install it once, but load it each time we restart R or otherwise clear the R environment. By way of extended metaphor, install.packages is like having a book on your bookshelf, or closed on your desk. By contrast library gets the book and opens. For the stringr package, the install.packages and library commands are:
install.packages("stringr") ## run only once
library(stringr) ## run with each new R session
The stringr package has a series of logically named functions for handling strings, a technical term for alphanumeric text. A good example of such a simple, logical function is str_count. What do you suppose this command does?
str_count(string = Meiroku.df$text, pattern = "女")
## [1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
## [26] 0 0 0 0 0 0 0 1 0 2 6 0 0 0 1 0 0 0 0 0 0 0 1 2 0
## [51] 0 0 2 0 4 0 1 0 0 0 0 8 0 0 0 0 1 1 0 0 0 0 0 0 2
## [76] 0 0 0 0 0 1 0 2 5 1 2 1 0 0 0 32 0 5 0 0 0 0 0 0 0
## [101] 0 1 0 0 4 0 0 2 0 1 0 0 0 0 0 0 0 0 7 0 1 10 0 23 0
## [126] 5 0 0 0 4 0 5 0 2 10 1 0 0 0 0 0 0 0 0 0 0 2 0 0 1
## [151] 0 0 0 0 4
In this example, I have used an explicit form of the function str_count, specifying the argument names. The argument string =, for example, tells the function where to search, and pattern = tells R what to look for. You will often see code that skips these explicit flags. The two commands below, for example, will generate the same results because R infers the arguments string and pattern by position. Nonetheless, I find explicit code easier to follow, both in teaching and my own research
str_count(string = Meiroku.df$text, pattern = "女")
str_count(Meiroku.df$text, "女")
Note that this command counts the character 女 both alone and in longer compounds such as 男女 and 女性. We’ll explore methods for refining that search soon. For now, as an interim method, you can add whitespace and search for " 女 " That will miss the occasional cases of 女 at the beginning or end of a sentence, or (if there’s punctuation) before a period or comma. We’ll cover a more refined method of search in the next session.
Rather than just let the results of str_count hang loose, we can add them to the data frame Meiroku.df, creating a new column called 女. Use the $ operator to put the vector in the data frame.
Meiroku.df$女 <- str_count(string = Meiroku.df$text, pattern = "女")
We can now use the same tricks as before to subset a data frame. Let’s select every essay in the Meiroku zasshi that used the characters 女 more than 自由
Meiroku.df$自由 <- str_count(string = Meiroku.df$text, pattern = "自由")
Meiroku_subset.df <- Meiroku.df[which(Meiroku.df$女>Meiroku.df$自由),]
We can, of course, add additional criteria, such as choosing only works by Mori Arinori that use 女 more than 自由. We can either subset in several steps . . .
Meiroku_subset_step_one.df <- Meiroku.df[which(Meiroku.df$女>Meiroku.df$自由),]
Meiroku_subset_step_two.df <- Meiroku_subset_step_one.df[which(Meiroku_subset_step_one.df$author=="森有礼"),]
Meiroku_subset_step_two.df$title
## [1] "妻妾論(一)" "妻妾論(二)" "妻妾論(三)" "妻妾論(四)" "妻妾論(五)"
Or combine the conditions with the & marker
Meiroku_subset.df <- Meiroku.df[which(Meiroku.df$女>Meiroku.df$自由 & Meiroku.df$author=="森有礼"),]
Meiroku_subset.df$title
## [1] "妻妾論(一)" "妻妾論(二)" "妻妾論(三)" "妻妾論(四)" "妻妾論(五)"
You can also combine conditions with the “or” operator | , the uppercase version of the “backslash.” If you want the titles of essays written by either Mori Arinori or Katō Hiroyuki . .
Meiroku.df$title[which(Meiroku.df$author=="加藤弘之" | Meiroku.df$author=="森有礼")]
## [1] "福沢先生の論に荅ふ"
## [2] "学者職分論の評"
## [3] "開化第一話"
## [4] "民撰議院設立建言書之評"
## [5] "ブルンチュリ氏国法汎論摘訳民選議院不可立の論"
## [6] "米国政教(一)"
## [7] "米国政教(二)"
## [8] "宗教"
## [9] "独立国権義"
## [10] "武官の恭順"
## [11] "妻妾論(一)"
## [12] "妻妾論(二)"
## [13] "米国政教(三)"
## [14] "妻妾論(三)"
## [15] "軽国政府"
## [16] "妻妾論(四)"
## [17] "妻妾論(五)"
## [18] "明六社第一年回役員改選に付演説"
## [19] "夫婦同権の流弊論"
## [20] "〔第三十一号「夫婦同権の流弊論」に対するコメント〕"
Take a moment to experiment with subsetting, creating new variables, and specifying multiple criteria.
We’re now going to shift from straightforward, simple code to some dense, advanced commands. That means, for now, just focusing on a few key arguments and ignoring other parts of the command. R has some wonderful packages for visualizing data. We’ll use ggplot, a great graphing package, and make the graphs interactive with plotly. As with all packages, you’ll need to install them once, but only once.
install.packages("ggplot2") ## run only once
install.packages("plotly") ## run only once
library(ggplot2) ## run with each new R session
library(plotly) ## run with each new R session
## Warning: package 'plotly' was built under R version 3.6.2
## base plot
ggplot(data=Meiroku.df, mapping=aes(x=女,y=自由,label=author)) + geom_point()
That’s a basic scatterplot of 女 and 自由, which 女 on the x axis and 自由 on the y axis. Your platform probably did NOT map the unicode glyphs properly. We’ll fix that later. First, let’s parse the ggplot command. This is an intimidating bit of code, so just focus on the parts you will need to change in order to use it for your own research. Consider this version in pseudo code
## pseudo code — don't run this!
## use as a template
ggplot(data=NAME_OF_DATA_FRAME, mapping=aes(x=WORD_COUNT, y=ANOTHER_WORD_COUNT)) +
geom_point()
In order to change the terms, just change the parts labeled WORD_COUNT and/or ANOTHER_WORD_COUNT. Note that if you want to visualize a term, you need to first get the word count. If, for example, we want to plot 女 against 男, we need to generate the count for 男
Meiroku.df$男 <- str_count(string = Meiroku.df$text, pattern = "男")
Now let’s reuse the plotting code, replacing 自由 with 男
ggplot(data=Meiroku.df, mapping=aes(x=女,y=男)) + geom_point()
Take a moment to experiment, searching for patterns of co-occurrence. And let’s fix the font problem, by telling ggplot which font to use. The addition below should work on most versions of OSX and Windows. Note that we can just keep adding features to our graph with the + sign.
ggplot(data=Meiroku.df, mapping=aes(x=女,y=男)) +
geom_point() +
theme_grey(base_family="Osaka")
It seems that 女 and 男 co-occur. In the Meiroku zasshi, at least, when authors wrote about men (explicitly marked with 男), they also wrote about women.
We can make this graph interactive with a simple tweak, sending the ggplot output to ggplotly. We’ll also add the argument label=author, so that author names appear on mouseover. Run the code and explore. What can you infer about the essays from this graph?
plot_output <- ggplot(data=Meiroku.df, mapping=aes(x=女,y=男,label=author)) + geom_point()
ggplotly(plot_output)
As you may have noticed, these graphs do not contain 155 points because some of the articles have the exact same values. This problem is called overplotting: we can’t see some of the observations because they are underneath other observations with the same values.
In this case, we can fix the problem by recounting the words as percentages of the total characters in each article. That’s sometimes called “normalizing.” The nchar command below is rather intuitive — it counts the number of characters.
Meiroku.df$男 <- str_count(string = Meiroku.df$text, pattern = "男")/nchar(Meiroku.df$text)*100
Meiroku.df$自由 <- str_count(string = Meiroku.df$text, pattern = "自由")/nchar(Meiroku.df$text)*100
Meiroku.df$女 <- str_count(string = Meiroku.df$text, pattern = "女")/nchar(Meiroku.df$text)*100
Let’s run the code again and compare the results. This time we’ll wrap the ggplot command within a ggplotly command for a one liner. Coders celebrate one liners as emblems of concision, but they can be hard to read. Especially as a novice, feel free to be verbose — if your code runs, you win.
ggplotly(ggplot(data=Meiroku.df, mapping=aes(x=女,y=男,label=author)) + geom_point())
# this is the same as the three line version below
# plot_output <- ggplot(data=Meiroku.df, mapping=aes(x=女,y=男,label=author)) +
# geom_point()
# ggplotly(plot_output)
Note that any numeric variable can be used for x and y in ggplot, so here’s how the usage of 女 varied over time
ggplotly(ggplot(data=Meiroku.df, mapping=aes(x=issue,y=女,label=author)) + geom_point())
Such basic scatterplots lie at the base of many more complex forms of analysis. Correlation analysis, regression analysis (OLS), factor analysis, principal component analysis (PCA), and cluster analysis can all be understood through such simple visualizations. Consider, for example, factor analysis and PCA. As you can see above, the scatter of points for 男 and 女 lies fairly close to a line. Factor analysis and PCA draw a line through the scatter that minimizes the distance to the points. (Technically, we draw a line that minimizes the sum of the squared distances measured perpendicularly from each data point to the PCA line — as shown by the red line below). We can then “reduce the dimensionality” of the data by treating the PCA as a proxy for the original data points. Instead of having data for both 女 and 男 (an x coordinate and a y coordinate), we can approximate the data by its position on the PCA line, a single coordinate.
We can also measure the similarity or difference of two texts by calculating their distance in the scatter plot. And we can measure the difference (in word usage) of two authors by aggregating those distances. The details of those techniques are well beyond this workshop, but you now have a sense of the underlying concepts.
## Warning: package 'retistruct' was built under R version 3.6.2
Thus far we have searched the Meiroku zasshi using literal expressions, finding, for example, instances of 女 by searching for the character 女. But what about compounds? How can we search for compounds beginning (or ending) with 女? Such searches require regular expressions or regex, which are terms specifying the type of glyph or its location, rather than the exact glyph.
Let’s begin with a simple example: we’ll search a few characters before and after a given string. In regex, the “period” character “.” means “any character, including whitespace.”
library(stringr)
str_extract_all(string = "これはペンです", pattern = "は", simplify = TRUE)
## [,1]
## [1,] "は"
str_extract_all(string = "これはペンです", pattern = ".は.", simplify = TRUE)
## [,1]
## [1,] "れはペ"
str_extract_all(string = "これはペンです", pattern = "..は..", simplify = TRUE)
## [,1]
## [1,] "これはペン"
The function str_extract_all, as the name suggests, extracts all the strings matching the pattern argument. (Argument is the technical term for the details of a function or command.) More interesting is the role of the period in that pattern argument. Note how the argument pattern = “..は..” gets two characters on either side of “は”
Let’s try looking at something more substantial than “これはペンです”. We’ll use the 1889 Imperial Rescript on Education and search for the characters (or, more precisely, glyphs) around 皇
rescript <- "朕惟フニ我カ皇祖皇宗國ヲ肇ムルコト宏遠ニ德ヲ樹ツルコト深厚ナリ我カ臣民克ク忠ニ克ク孝ニ億兆心ヲ一ニシテ世世厥ノ美ヲ濟セルハ此レ我カ國體ノ精華ニシテ敎育ノ淵源亦實ニ此ニ存ス爾臣民父母ニ孝ニ兄弟ニ友ニ夫婦相和シ朋友相信シ恭儉己レヲ持シ博愛衆ニ及ホシ學ヲ修メ業ヲ習ヒ以テ智能ヲ啓發シ德器ヲ成就シ進テ公益ヲ廣メ世務ヲ開キ常ニ國憲ヲ重シ國法ニ遵ヒ一旦緩急アレハ義勇公ニ奉シ以テ天壤無窮ノ皇運ヲ扶翼スヘシ是ノ如キハ獨リ朕カ忠良ノ臣民タルノミナラス又以テ爾祖先ノ遺風ヲ顯彰スルニ足ラン斯ノ道ハ實ニ我カ皇祖皇宗ノ遺訓ニシテ子孫臣民ノ俱ニ遵守スヘキ所之ヲ古今ニ通シテ謬ラス之ヲ中外ニ施シテ悖ラス朕爾臣民ト俱ニ拳々服膺シテ咸其德ヲ一ニセンコトヲ庶幾フ"
str_extract_all(string = rescript, pattern = "..皇..", simplify = TRUE)
## [,1] [,2] [,3]
## [1,] "我カ皇祖皇" "窮ノ皇運ヲ" "我カ皇祖皇"
Regex also allows us to search for multiple possible characters. We can search 皇 OR 朕 in two ways. First, square brackets have an implied OR so [皇朕] means the characters 皇 OR 朕. We can also use round brackets with the vertical line as an explicit OR.
str_extract_all(string = rescript, pattern = "..[皇朕]..", simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5]
## [1,] "我カ皇祖皇" "窮ノ皇運ヲ" "獨リ朕カ忠" "我カ皇祖皇" "ラス朕爾臣"
str_extract_all(string = rescript, pattern = "..(皇|朕)..", simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5]
## [1,] "我カ皇祖皇" "窮ノ皇運ヲ" "獨リ朕カ忠" "我カ皇祖皇" "ラス朕爾臣"
This is a rudimentary form of KWIC, or “key words in context.” Take a moment to experiment with the command above, changing the kanji and the number of characters. Here’s a simple exercise: what is the context of 公 in the Imperial Rescript on Education?
Rather than adding periods, you can use a number in “curly brackets” to specify repetition. let’s look for 4 glyphs on either side of 民
str_extract_all(string = rescript, pattern = ".{4}民.{4}")
## [[1]]
## [1] "リ我カ臣民克ク忠ニ" "存ス爾臣民父母ニ孝" "忠良ノ臣民タルノミ"
## [4] "テ子孫臣民ノ俱ニ遵" "ス朕爾臣民ト俱ニ拳"
The curly brackets can also be used to specify a range of numbers. The Imperial Rescript on Education begins 朕 and the search string “..朕..” won’t find that instance because there are no character before 朕.
str_extract_all(string = rescript, pattern = ".{0,4}朕.{0,4}")
## [[1]]
## [1] "朕惟フニ我" "キハ獨リ朕カ忠良ノ" "テ悖ラス朕爾臣民ト"
That {0,4} notation tells R to search for “between 0 and 4 glyphs, before and after 朕, returning as many as possible”
Another powerful regex symbol is the asterisk, which means “zero or more times.” For example, to get everything between 朕 and 民, we search on "朕.*民"
str_extract_all(string = rescript, pattern = "朕.*民")
## [[1]]
## [1] "朕惟フニ我カ皇祖皇宗國ヲ肇ムルコト宏遠ニ德ヲ樹ツルコト深厚ナリ我カ臣民克ク忠ニ克ク孝ニ億兆心ヲ一ニシテ世世厥ノ美ヲ濟セルハ此レ我カ國體ノ精華ニシテ敎育ノ淵源亦實ニ此ニ存ス爾臣民父母ニ孝ニ兄弟ニ友ニ夫婦相和シ朋友相信シ恭儉己レヲ持シ博愛衆ニ及ホシ學ヲ修メ業ヲ習ヒ以テ智能ヲ啓發シ德器ヲ成就シ進テ公益ヲ廣メ世務ヲ開キ常ニ國憲ヲ重シ國法ニ遵ヒ一旦緩急アレハ義勇公ニ奉シ以テ天壤無窮ノ皇運ヲ扶翼スヘシ是ノ如キハ獨リ朕カ忠良ノ臣民タルノミナラス又以テ爾祖先ノ遺風ヲ顯彰スルニ足ラン斯ノ道ハ實ニ我カ皇祖皇宗ノ遺訓ニシテ子孫臣民ノ俱ニ遵守スヘキ所之ヲ古今ニ通シテ謬ラス之ヲ中外ニ施シテ悖ラス朕爾臣民"
The asterisk prompts a “greedy” search, stopping at the last instance of 民. We can specify a “non-greedy” or “lazy” search by adding a question mark.
str_extract_all(string = rescript, pattern = "朕.*?民")
## [[1]]
## [1] "朕惟フニ我カ皇祖皇宗國ヲ肇ムルコト宏遠ニ德ヲ樹ツルコト深厚ナリ我カ臣民"
## [2] "朕カ忠良ノ臣民"
## [3] "朕爾臣民"
There are many guides to regex, for example https://www.rexegg.com/regex-quickstart.html, and I’ve included some key terms in the Appendix below. In this short introduction, let’s turn to Japan specific regex terms. We can specify hiragana, katakana, or kanji. Do the following search strings make sense?
str_extract_all(string = rescript, pattern = "\\p{Han}{2}", simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "朕惟" "皇祖" "皇宗" "宏遠" "深厚" "臣民" "億兆" "世世" "國體" "精華"
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] "敎育" "淵源" "亦實" "爾臣" "民父" "兄弟" "夫婦" "相和" "朋友" "相信"
## [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
## [1,] "恭儉" "博愛" "智能" "啓發" "德器" "成就" "公益" "世務" "國憲" "國法"
## [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] [,39] [,40]
## [1,] "一旦" "緩急" "義勇" "天壤" "無窮" "皇運" "扶翼" "忠良" "臣民" "又以"
## [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## [1,] "爾祖" "遺風" "顯彰" "皇祖" "皇宗" "遺訓" "子孫" "臣民" "遵守" "所之"
## [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58]
## [1,] "古今" "中外" "朕爾" "臣民" "拳々" "服膺" "咸其" "庶幾"
str_extract_all(string = rescript, pattern = "\\p{Han}{3}", simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] "皇祖皇" "臣民克" "億兆心" "世世厥" "淵源亦" "爾臣民" "夫婦相" "朋友相"
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] "恭儉己" "博愛衆" "一旦緩" "義勇公" "天壤無" "爾祖先" "皇祖皇" "子孫臣"
## [,17] [,18] [,19]
## [1,] "朕爾臣" "拳々服" "咸其德"
str_extract_all(string = rescript, pattern = "\\p{Katakana}{2}", simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "フニ" "ムル" "コト" "ツル" "コト" "ナリ" "ニシ" "セル" "ニシ" "レヲ"
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] "ホシ" "アレ" "スヘ" "キハ" "タル" "ノミ" "ナラ" "スル" "ラン" "ニシ"
## [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29]
## [1,] "スヘ" "シテ" "ラス" "シテ" "ラス" "シテ" "ニセ" "ンコ" "トヲ"
str_extract_all(string = rescript, pattern = "民\\p{Katakana}.", simplify = TRUE)
## [,1] [,2] [,3]
## [1,] "民タル" "民ノ俱" "民ト俱"
Let’s use these regex terms to search the Meiroku zasshi. How was the character 女 used? We’ll “extract” all the instances of “\\p{Han}{0,1}女\\p{Han}{0,1}” from the 150th article in Meiroku zasshi with the command below:
str_extract_all(string = Meiroku.df$text[150], pattern = "\\p{Han}{0,1}女\\p{Han}{0,1}")
## [[1]]
## [1] "婦女"
The result, as you can see, is the single compound 婦女, or “woman.” Let’s run the same search on a bigger chunk of Meiroku zasshi, articles 120 to 130.
str_extract_all(string = Meiroku.df$text[120:130], pattern = "\\p{Han}{0,1}女\\p{Han}{0,1}")
## [[1]]
## character(0)
##
## [[2]]
## [1] "女王"
##
## [[3]]
## [1] "男女" "男女" "男女" "男女" "男女" "男女" "女" "女" "女" "女"
##
## [[4]]
## character(0)
##
## [[5]]
## [1] "女" "男女" "女" "女王" "女" "女" "女權" "女" "女" "女"
## [11] "女" "女伴" "男女" "女" "女帝" "女王" "女下" "女色" "婢女" "男女"
## [21] "女" "女" "女主"
##
## [[6]]
## character(0)
##
## [[7]]
## [1] "女子" "女子" "男女" "男女" "男女"
##
## [[8]]
## character(0)
##
## [[9]]
## character(0)
##
## [[10]]
## character(0)
##
## [[11]]
## [1] "婦女" "男女" "女子" "男女"
What is that thing? What are all those brackets? Let’s review the two data structures we’ve discussed so far: data frames and vectors. Vectors are rows (or columns) of data, and data frames are neat, rectangular combinations of vectors. But str_extract_all doesn’t return a data frame, because the result isn’t neat and rectangular. Some essays in the Meiroku zasshi don’t use 女 at all, so those rows are empty, but the essay Meiroku.df$title[91]
by Meiroku.df$author[91]
uses 女 a total of str_count(Meiroku.df$text[91],"女")
times. R cannot make a data frame when the vectors are uneven, so it returns a list. The double brackets above mark the essay number, and each following vector contains the 女 usage for that essay.
Because they can be uneven, lists are wonderfully flexible, but we usually need to beat lists into something manageable before analyzing the data. In this case, we’ll simply smash all the data together into one vector with unlist.
unlist(str_extract_all(string = Meiroku.df$text, pattern = "\\p{Han}{0,1}女\\p{Han}{0,1}"))
## [1] "兒女" "女子" "女王" "婦女" "遊女" "買女" "子女" "子女" "女學" "女子"
## [11] "女" "女學" "婦女" "女主" "女子" "女子" "婦女" "婦女" "子女" "男女"
## [21] "子女" "子女" "女子" "女子" "女子" "女子" "女子" "女" "女子" "女貞"
## [31] "女子" "女" "女王" "女" "男女" "男女" "男女" "童女" "女子" "女子"
## [41] "女子" "女" "女子" "婦女" "女王" "男女" "女媧" "女飾" "女" "男女"
## [51] "女" "女" "女" "女" "女" "女弊" "女" "男女" "男女" "女"
## [61] "女" "女" "男女" "女" "女" "女飾" "女容" "男女" "女徳" "女頭"
## [71] "女氣" "女" "女" "女" "女頭" "女頭" "女" "女" "女" "女"
## [81] "婦女" "男女" "男女" "女性" "紅女" "女" "女子" "女子" "女" "女"
## [91] "女" "女" "女" "男女" "子女" "子女" "子女" "子女" "男女" "女王"
## [101] "男女" "男女" "男女" "男女" "男女" "男女" "女" "女" "女" "女"
## [111] "女" "男女" "女" "女王" "女" "女" "女權" "女" "女" "女"
## [121] "女" "女伴" "男女" "女" "女帝" "女王" "女下" "女色" "婢女" "男女"
## [131] "女" "女" "女主" "女子" "女子" "男女" "男女" "男女" "婦女" "男女"
## [141] "女子" "男女" "男女" "男女" "男女" "男女" "男女" "女" "女" "男女"
## [151] "男女" "男女" "男女" "女子" "女子" "女" "女性" "男女" "女子" "女郎"
## [161] "婦女" "女" "婦女" "兒女" "女郎" "女" "女郎"
Now we’ve lost the data for which term goes with which essay, but in this simpler form we can easily get a table of word frequency. We’ll use the table command, and then make the table into a data frame.
onna_instances <- unlist(str_extract_all(string = Meiroku.df$text, pattern = ".女"))
onna.df <- data.frame(table(onna_instances))
You can see which compounds are the most common by opening onna.df in the source pane and clicking on the “up-down” buttons in the header.
If you’d like to order the data frame within R, use the order command. Notice that the syntax is almost the same as which except that instead of subsetting the data frame, we are changing the order.
onna.df[order(-onna.df$Freq),]
## onna_instances Freq
## 1 女 104
## 6 男女 38
## 4 婦女 9
## 5 子女 9
## 2 兒女 2
## 3 婢女 1
## 7 童女 1
## 8 紅女 1
## 9 買女 1
## 10 遊女 1
Does this data frame help explain the co-occurrence of 女 and 男?
We now have some fairly powerful tools, but these methods are somewhat labor-intensive. There are over 15,000 unique words in the Meiroku zasshi and it would be cumbersome to write 15,000 lines of code by hand, one for each term.
Fortunately, R loves to help with repetitive tasks so we can write 7 or 8 lines of code instead of 15,000. Unfortunately, some of that code is rather advanced, so, for now at least, you’ll just have to use the commands without fully understanding the details. Much of the complexity below involves turning lists and matrices into data frames.
For now we’ll need a list of all the unique terms in the Meiroku zasshi. To get that, we’ll need to smash all the individual articles together into one long string. We’ll use the command paste, the text equivalent of addition.
1 + 2 + 3 ## addition works for numbers
## [1] 6
"a" + "b" ## but not for letters
## Error in "a" + "b": non-numeric argument to binary operator
paste("a", "b", "c") ## the text equivalent of addition
## [1] "a b c"
Because the individual articles are elements of a vector, we need to add the collapse argument. Note how this “collapses” all the articles into one long string
Complete_meiroku <- paste(Meiroku.df$text, collapse = " ")
Now we can split the string into individual words, separating on whitespace. The command str_split is appropriately named: it splits strings.
Complete_meiroku_split <- str_split(string = Complete_meiroku, pattern = " ")
Complete_meiroku_split <- unlist(Complete_meiroku_split)
The object Complete_meiroku_split is now a vector with 173,197 elements, the total word count for all 155 articles. Let’s use table to quickly and easily calculate the frequency of every word in the Meiroku zasshi.
Meiroku_frequency <- data.frame(table(Complete_meiroku_split))
Glance at Meiroku_frequency.df. Which terms are common and which are rare?
To get a list of unique words
Meiroku_unique_words <- unique(Complete_meiroku_split)
Note that Meiroku_unique_words is much smaller: only 15,603 elements.
To create a document term matrix we need to search every article in the Meiroku zasshi for each of these terms. And we need to tell R to distinguish between 女 by itself and in a compound. We’ll use the regex string \b meaning “word boundary.” That includes spaces and punctuation, as well as two positions: start and end of string. So “\b女” will capture both " 女" and 女 at the start of a text string. We’ll add \b before and after every unique term in Meiroku_unique_words
Meiroku_search_terms <- paste("\\b",Meiroku_unique_words,"\\b",sep="")
Now we could go through this vector one element at a time, and search for each word. For example, element number 85 of the vector is But R is quite happy to run down the entire vector for us. The handy command is sapply. This function may take a minute or two, since R needs to look for 173,197 words in 155 articles. The command sapply takes two main arguments. First we specify the vector of search terms, Meiroku_search_terms. Second we specify that the function that R should apply to all those terms, str_count(string = Meiroku.df$text, pattern = x). We’ll call the result dtm.df, a document term matrix data frame form.
dtm.matrix <- sapply(X = Meiroku_search_terms,
FUN = function(x) str_count(string = Meiroku.df$text, pattern = x))
dtm.df <- as.data.frame(dtm.matrix)
If you glance at dtm.df you’ll see that it is 155 rows long (one for each article) and 173,197 columns wide, one for each unique word. Hence the name document term matrix
I can’t even pretend that sapply is intuitive. Unlike str_count, there is no way to guess from the name what it does. But, as you can see, it’s pretty convenient. Rather than having us go through the vector Meiroku_search_terms word by word (or element by element), sapply takes Meiroku_search_terms as “X” and then “applies” the function str_count(Meiroku.df$text, x) to every element of “x”
While we’re using advanced apply family functions, let’s “normalize” the word counts, dividing by the total number of words in each document. We did this before for 女 and 男, but we don’t want to rewrite code for every unique term in the Meiroku zasshi. To automate the job, we need to get the totals for each of the 155 articles (each row) and then divide each of the 15,603 columns by that total. That’s roughly 2,418,620 calculations, but the code is just one line. The key things to note about this function are:
dtm_norm.matrix <- apply(X = dtm.matrix, MARGIN = 1, FUN = function(x) x/sum(x)*100)
dtm_norm.df <- as.data.frame(t(dtm_norm.matrix))
If you look at dtm_norm.matrix you’ll see that it’s switched, with documents as columns rather than as rows. We can switch it back with t for “transpose.”
In order to make the most of this document term matrix let’s join it to the metadata in the original Meiroku.df. Both data frames have the same number of rows, so R can bind the columns together with cbind for “column bind”. This just pushes the columns of the two data frames together.
Meiroku_dtm.df <- cbind(Meiroku.df[,c(1:5)], dtm_norm.df)
With this document term matrix, you can now explore the Meiroku zasshi based on the frequency of any word.
Since it took a bit of effort and processing time to create this dtm (document term matrix), let’s save it for later use. The details of __file = "_ON_YOUR_DRIVE_dtm_df.txt" will depend on your OS (Mac or Windows): Windows requires double backslashes. Use the suffix .txt__. The argument col.names = TRUE preserves the column names, while sep = " separates the columns with tabs.
write.table(Meiroku_dtm.df, file = "\FOLDER_ON_YOUR_DRIVE\Meiroku_dtm_df.txt", col.names = TRUE, sep = "\t")
Let’s do some final manipulation of the document term matrix, aggregating by author. Which authors favored which words? First, let’s see how many authors wrote for the Meiroku zasshi
unique(Meiroku.df$author)
## [1] "西周" "西村茂樹" "加藤弘之" "森有礼" "津田真道"
## [6] "杉亨二" "箕作麟祥" "柴田昌吉" "清水卯三郎" "箕作秋坪"
## [11] "中村正直" "阪谷素" "神田孝平" "福沢諭吉" "柏原孝章"
## [16] "津田仙"
Now let’s aggregate the word frequencies by author. We’ll get the total word count for each author, and then “re-normalize” the dtm. We’ll create a data frame temp.df that just has the author names and the word counts. One catch is that the names of the authors are non-numeric, so we’ll need to tell R not to do math on the author names, or any other metadata
The first five columns are metadata, so we want R to start with the sixth column and go to the last column: c(6:ncol(temp.df)). To aggregate on the names of the authors, we’ll tell R to look at the first column: Meiroku_dtm.df[,4]. The argument FUN = sum tells R to get sums (rather than, for example, means) by author. Finally, the command to aggregate is logically named aggregate and the syntax is fairly logical:
Meiroku_author_dtm.df <- aggregate(Meiroku_dtm.df[,c(6:ncol(Meiroku_dtm.df))],
by = list(Meiroku_dtm.df[,4]),
FUN = sum)
If we want to “normalize” the dtm we can reuse the apply code above to turn the raw counts into percentages, “applying” the formula __x/sum(x)*100__ to each elements of each row.
Meiroku_author.dtm <- apply(X = Meiroku_author_dtm.df[,c(2:ncol(Meiroku_author_dtm.df))], MARGIN = 1, FUN = function(x) x/sum(x)*100)
Meiroku_author_dtm_temp.df <- as.data.frame(t(Meiroku_author.dtm))
##since this matrix is purely numeric we need to add the author names back in
Meiroku_author_dtm.df <- cbind(Meiroku_author_dtm.df[1],Meiroku_author_dtm_temp.df)
And let’s tidy up the column names, giving the first column the name “authors”. We can also clean up the column names, getting rid of these necessary but ugly \\b marks
colnames(Meiroku_author_dtm.df) <- c("authors",Meiroku_unique_words)
Which authors liked which words? Let’s graph some ideas:
ggplotly(
ggplot(Meiroku_author_dtm.df,
aes(女,男, label=authors)) +
geom_point()
)
Fukuzawa seems to be in a class by himself. How might we cluster the other authors?
Japanese is one of three major languages undelimited by whitespace: there are no spaces between words or “tokens.” The other two languages are Chinese and Tibetan. You can do extensive analysis without tokenization, but it’s often useful to have the text broken into words.
I have a complete guide to tokenizing at http://laits.utexas.edu/~mr56267/Japanese_Text_Mining/MeCab_RMeCab.html, but it’s worth exploring the web-based tools offered by NINJAL at http://chamame.ninjal.ac.jp/
Here’s a basic workflow for using the online tools.
We’ll be working with Tanizaki’s Naomi (痴人の愛), available in two formats from the Aozora bunko at https://www.aozora.gr.jp/cards/001383/card58093.html
Here are two equally effective means of reading in the text. To read directly from the website
library(readr)
chijin_xml <- read_lines("https://www.aozora.gr.jp/cards/001383/files/58093_62049.html",
locale = locale(encoding = "SHIFT-JIS"))
The locale argument will take care of the SHIFT-JIS encoding issue. The xml file is full of formatting tags, but we can clean those out easily with some regex. Since almost all HTML and XML tags are between “less than” and “greater than” marks, so we can just search for those marks.
chijin_xml <- str_replace_all(chijin_xml, "<.*?>", "")
Now we need to remove the bit of non-text data at the head and foot of the text. We’ll create a new object that retains just the novel.
chijin_xml[1:30] ## we can start with line 18
## [1] ""
## [2] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\""
## [3] " \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">"
## [4] ""
## [5] ""
## [6] "\t"
## [7] "\t"
## [8] "\t"
## [9] "\t谷崎潤一郎 痴人の愛"
## [10] "\t"
## [11] " "
## [12] "\t"
## [13] "\t"
## [14] "\t"
## [15] ""
## [16] ""
## [17] ""
## [18] "痴人の愛"
## [19] "谷崎潤一郎"
## [20] ""
## [21] ""
## [22] ""
## [23] ""
## [24] "一"
## [25] ""
## [26] "私はこれから、あまり世間に類例がないだろうと思われる私達夫婦の間柄に就いて、出来るだけ正直に、ざっくばらんに、有りのままの事実を書いて見ようと思います。それは私自身に取って忘れがたない貴い記録であると同時に、恐らくは読者諸君に取っても、きっと何かの参考資料となるに違いない。殊(こと)にこの頃のように日本もだんだん国際的に顔が広くなって来て、内地人と外国人とが盛んに交際する、いろんな主義やら思想やらが這入(はい)って来る、男は勿論(もちろん)女もどしどしハイカラになる、と云(い)うような時勢になって来ると、今まではあまり類例のなかった私たちの如(ごと)き夫婦関係も、追い追い諸方に生じるだろうと思われますから。"
## [27] "考えて見ると、私たち夫婦は既にその成り立ちから変っていました。私が始めて現在の私の妻に会ったのは、ちょうど足かけ八年前のことになります。尤(もっと)も何月の何日だったか、委(くわ)しいことは覚えていませんが、とにかくその時分、彼女は浅草の雷門の近くにあるカフエエ・ダイヤモンドと云う店の、給仕女をしていたのです。彼女の歳(とし)はやっと数え歳の十五でした。だから私が知った時はまだそのカフエエへ奉公に来たばかりの、ほんの新米だったので、一人前の女給ではなく、それの見習い、―――まあ云って見れば、ウエイトレスの卵に過ぎなかったのです。"
## [28] "そんな子供をもうその時は二十八にもなっていた私が何で眼をつけたかと云うと、それは自分でもハッキリとは分りませんが、多分最初は、その児(こ)の名前が気に入ったからなのでしょう。彼女はみんなから「直ちゃん」と呼ばれていましたけれど、或(あ)るとき私が聞いて見ると、本名は奈緒美(なおみ)と云うのでした。この「奈緒美」という名前が、大変私の好奇心に投じました。「奈緒美」は素敵だ、NAOMI と書くとまるで西洋人のようだ、と、そう思ったのが始まりで、それから次第に彼女に注意し出したのです。不思議なもので名前がハイカラだとなると、顔だちなども何処(どこ)か西洋人臭く、そうして大そう悧巧(りこう)そうに見え、「こんな所の女給にして置くのは惜しいもんだ」と考えるようになったのです。"
## [29] "実際ナオミの顔だちは、(断って置きますが、私はこれから彼女の名前を片仮名で書くことにします。どうもそうしないと感じが出ないのです)活動女優のメリー・ピクフォードに似たところがあって、確かに西洋人じみていました。これは決して私のひいき眼ではありません。私の妻となっている現在でも多くの人がそう云うのですから、事実に違いないのです。そして顔だちばかりでなく、彼女を素っ裸にして見ると、その体つきが一層西洋人臭いのですが、それは勿論後になってから分ったことで、その時分には私もそこまでは知りませんでした。ただおぼろげに、きっとああ云うスタイルなら手足の恰好(かっこう)も悪くはなかろうと、着物の着こなし工合から想像していただけでした。"
## [30] "一体十五六の少女の気持と云うものは、肉親の親か姉妹ででもなければ、なかなか分りにくいものです。だからカフエエにいた頃のナオミの性質がどんなだったかと云われると、どうも私には明瞭(めいりょう)な答えが出来ません。恐らくナオミ自身にしたって、あの頃はただ何事も夢中で過したと云うだけでしょう。が、ハタから見た感じを云えば、孰方(どっち)かと云うと、陰鬱(いんうつ)な、無口な児のように思えました。顔色なども少し青みを帯びていて、譬(たと)えばこう、無色透明な板ガラスを何枚も重ねたような、深く沈んだ色合をしていて、健康そうではありませんでした。これは一つにはまだ奉公に来たてだったので、外の女給のようにお白粉(しろい)もつけず、お客や朋輩(ほうばい)にも馴染(なじみ)がうすく、隅の方に小さくなって黙ってチョコチョコ働いていたものだから、そんな風に見えたのでしょう。そして彼女が悧巧そうに感ぜられたのも、やっぱりそのせいだったかも知れません。"
chijin_xml[2450:2504]
## [1] "さて、話はこれから三四年の後のことになります。"
## [2] "私たちは、あれから横浜へ引き移って、かねてナオミの見つけて置いた山手の洋館を借りましたけれども、だんだん贅沢が身に沁(し)みるに従い、やがてその家も手狭だと云うので、間もなく本牧(ほんもく)の、前に瑞西(スイス)人の家族が住んでいた家を、家具ぐるみ買って、そこへ這入るようになりました。あの大地震で山手の方は残らず焼けてしまいましたが、本牧は助かった所が多く、私の家も壁に亀裂(きれつ)が出来たぐらいで、殆どこれと云う損害もなしに済んだのは、全く何が仕合わせになるか分りません。ですから私たちは、今でもずっとこの家に住んでいる訳なのです。"
## [3] "私はその後、計画通り大井町の会社の方は辞職をし、田舎の財産は整理してしまって、学校時代の二三の同窓と、電気機械の製作販売を目的とする合資会社を始めました。この会社は、私が一番の出資者である代りに、実際の仕事は友達がやってくれているので、毎日事務所へ出る必要はないのですが、どう云う訳か、私が一日家にいるのをナオミが好まないものですから、イヤイヤながら日に一遍は見廻ることにしてあります。私は朝の十一時頃に、横浜から東京に行き、京橋の事務所へ一二時間顔を出して、大概夕方の四時頃には帰って来ます。"
## [4] "昔は非常な勤勉家で、朝は早起きの方でしたけれども、この頃の私は、九時半か十時でなければ起きません。起きると直ぐに、寝間着のまま、そっと爪先(つまさき)で歩きながら、ナオミの寝室の前へ行って、静かに扉をノックします。しかしナオミは私以上に寝坊ですから、まだその時分は夢現(ゆめうつつ)で、"
## [5] "「ふん」"
## [6] "と、微(かす)かに答える時もあり、知らずに寝ている時もあります。答があれば私は部屋へ這入って行って挨拶(あいさつ)をし、答がなければ扉の前から引き返して、そのまま事務所へ出かけるのです。"
## [7] "こう云(い)う風に、私たち夫婦はいつの間にか、別々の部屋に寝るようになっているのですが、もとはと云うと、これはナオミの発案でした。婦人の閨房(けいぼう)は神聖なものである、夫といえども妄(みだ)りに犯すことはならない、―――と、彼女は云って、広い方の部屋を自分が取り、その隣りにある狭い方のを私の部屋にあてがいました。そうして隣り同士とは云っても、二つの部屋は直接つながってはいないのでした。その間に夫婦専用の浴室と便所が挟まっている、つまりそれだけ、互に隔たっている訳で、一方の室から一方へ行くには、そこを通り抜けなければなりません。"
## [8] "ナオミは毎朝十一時過ぎまで、起きるでもなく睡(ねむ)るでもなく、寝床の中でうつらうつらと、煙草(たばこ)を吸ったり新聞を読んだりしています。煙草はディミトリノの細巻、新聞は都新聞、それから雑誌のクラシックやヴォーグを読みます。いや読むのではなく、中の写真を、―――主に洋服の意匠や流行を、―――一枚々々丁寧に眺めています。その部屋は東と南が開いて、ヴェランダの下に直ぐ本牧の海を控え、朝は早くから明るくなります。ナオミの寝台は、日本間ならば二十畳も敷けるくらいな、広い室(へや)の中央に据えてあるのですが、それも普通の安い寝台ではありません。或(あ)る東京の大使館から売り物に出た、天蓋(てんがい)の附いた、白い、紗(しゃ)のような帳(とばり)の垂れている寝台で、これを買ってから、ナオミは一層寝心地がよいのか、前よりもなお床離れが悪くなりました。"
## [9] "彼女は顔を洗う前に、寝床で紅茶とミルクを飲みます。その間にアマが風呂場(ふろば)の用意をします。彼女は起きて、真っ先に風呂へ這入り、湯上りの体を又暫(しばら)く横たえながら、マッサージをさせます。それから髪を結い、爪(つめ)を研(みが)き、七つ道具と云いますが中々七つどころではない、何十種とある薬や器具で顔じゅうをいじくり廻し、着物を着るのにあれかこれかと迷った上で、食堂へ出るのが大概一時半になります。"
## [10] "午(ひる)飯をたべてしまってから、晩まで殆ど用はありません。晩にはお客に呼ばれるか、或(あるい)は呼ぶか、それでなければホテルへダンスに出かけるか、何かしないことはないのですから、その時分になると、彼女はもう一度お化粧をし、着物を取り換えます。夜会がある時は殊(こと)に大変で、風呂場へ行って、アマに手伝わせて、体じゅうへお白粉(しろい)を塗ります。"
## [11] "ナオミの友達はよく変りました。浜田や熊谷はあれからふッつり出入りをしなくなってしまって、一と頃は例のマッカネルがお気に入りのようでしたが、間もなく彼に代った者は、デュガンと云う男でした。デュガンの次には、ユスタスと云う友達が出来ました。このユスタスと云う男は、マッカネル以上に不愉快な奴で、ナオミの御機嫌を取ることが実に上手で、一度私は、腹立ち紛れに、舞蹈会(ぶとうかい)の時此奴(こいつ)を打(ぶ)ん殴(なぐ)ったことがあります。すると大変な騒ぎになって、ナオミはユスタスの加勢をして「気違い!」と云って私を罵(ののし)る。私はいよいよ猛(たけ)り狂って、ユスタスを追い廻す。みんなが私を抱き止めて「ジョージ! ジョージ!」と大声で叫ぶ。―――私の名前は譲治ですが、西洋人は George の積りで「ジョージ」「ジョージ」と呼ぶのです。―――そんなことから、結局ユスタスは私の家へ来ないようになりましたが、同時に私も、又ナオミから新しい条件を持ち出され、それに服従することになってしまいました。"
## [12] "ユスタスの後にも、第二第三のユスタスが出来たことは勿論(もちろん)ですが、今では私は、我ながら不思議に思うくらい大人しいものです。人間と云うものは一遍恐ろしい目に会うと、それが強迫観念になって、いつまでも頭に残っていると見え、私は未(いま)だに、嘗(かつ)てナオミに逃げられた時の、あの恐ろしい経験を忘れることが出来ないのです。「あたしの恐ろしいことが分ったか」と、そう云った彼女の言葉が、今でも耳にこびり着いているのです。彼女の浮気と我が儘(まま)とは昔から分っていたことで、その欠点を取ってしまえば彼女の値打ちもなくなってしまう。浮気な奴だ、我が儘な奴だと思えば思うほど、一層可愛(かわい)さが増して来て、彼女の罠(わな)に陥ってしまう。ですから私は、怒れば尚更(なおさら)自分の負けになることを悟っているのです。"
## [13] "自信がなくなると仕方がないもので、目下の私は、英語などでも到底彼女には及びません。実地に附き合っているうちに自然と上達したのでしょうが、夜会の席で婦人や紳士に愛嬌(あいきょう)を振りまきながら、彼女がぺらぺらまくし立てるのを聞いていると、何しろ発音は昔から巧(うま)かったのですから、変に西洋人臭くって、私には聞きとれないことがよくあります。そうして彼女は、ときどき私を西洋流に「ジョージ」と呼びます。"
## [14] "これで私たち夫婦の記録は終りとします。これを読んで、馬鹿々々(ばかばか)しいと思う人は笑って下さい。教訓になると思う人は、いい見せしめにして下さい。私自身は、ナオミに惚(ほ)れているのですから、どう思われても仕方がありません。"
## [15] "ナオミは今年二十三で私は三十六になります。"
## [16] ""
## [17] ""
## [18] ""
## [19] ""
## [20] ""
## [21] ""
## [22] ""
## [23] "底本:「痴人の愛」新潮文庫、新潮社"
## [24] " 1947(昭和22)年11月10日発行"
## [25] " 2003(平成15)年6月10日116刷改版"
## [26] " 2011(平成23)年2月10日126刷"
## [27] "初出:「大阪朝日新聞」"
## [28] " 1924(大正13)年3月〜5月"
## [29] " 「女性」"
## [30] " 1924(大正13)年11月〜1925(大正14)年7月"
## [31] "※底本巻末の細江光氏による注解は省略しました。"
## [32] "入力:daikichi"
## [33] "校正:悠悠自炊"
## [34] "2017年6月25日作成"
## [35] "青空文庫作成ファイル:"
## [36] "このファイルは、インターネットの図書館、青空文庫(http://www.aozora.gr.jp/)で作られました。入力、校正、制作にあたったのは、ボランティアの皆さんです。"
## [37] ""
## [38] ""
## [39] ""
## [40] ""
## [41] ""
## [42] ""
## [43] "●表記について"
## [44] ""
## [45] "\tこのファイルは W3C 勧告 XHTML1.1 にそった形式で作成されています。"
## [46] "\t「くの字点」をのぞくJIS X 0213にある文字は、画像化して埋め込みました。"
## [47] ""
## [48] ""
## [49] ""
## [50] ""
## [51] ""
## [52] "●図書カード"
## [53] ""
## [54] ""
## [55] ""
The novel begins with line 26 (unless we wanted the author and title) and ends 16 lines after 2450, so we’ll subset the text as follows, using head and tail to confirm our choices
chijin <- chijin_xml[26:(2450+16)]
head(chijin)
## [1] "私はこれから、あまり世間に類例がないだろうと思われる私達夫婦の間柄に就いて、出来るだけ正直に、ざっくばらんに、有りのままの事実を書いて見ようと思います。それは私自身に取って忘れがたない貴い記録であると同時に、恐らくは読者諸君に取っても、きっと何かの参考資料となるに違いない。殊(こと)にこの頃のように日本もだんだん国際的に顔が広くなって来て、内地人と外国人とが盛んに交際する、いろんな主義やら思想やらが這入(はい)って来る、男は勿論(もちろん)女もどしどしハイカラになる、と云(い)うような時勢になって来ると、今まではあまり類例のなかった私たちの如(ごと)き夫婦関係も、追い追い諸方に生じるだろうと思われますから。"
## [2] "考えて見ると、私たち夫婦は既にその成り立ちから変っていました。私が始めて現在の私の妻に会ったのは、ちょうど足かけ八年前のことになります。尤(もっと)も何月の何日だったか、委(くわ)しいことは覚えていませんが、とにかくその時分、彼女は浅草の雷門の近くにあるカフエエ・ダイヤモンドと云う店の、給仕女をしていたのです。彼女の歳(とし)はやっと数え歳の十五でした。だから私が知った時はまだそのカフエエへ奉公に来たばかりの、ほんの新米だったので、一人前の女給ではなく、それの見習い、―――まあ云って見れば、ウエイトレスの卵に過ぎなかったのです。"
## [3] "そんな子供をもうその時は二十八にもなっていた私が何で眼をつけたかと云うと、それは自分でもハッキリとは分りませんが、多分最初は、その児(こ)の名前が気に入ったからなのでしょう。彼女はみんなから「直ちゃん」と呼ばれていましたけれど、或(あ)るとき私が聞いて見ると、本名は奈緒美(なおみ)と云うのでした。この「奈緒美」という名前が、大変私の好奇心に投じました。「奈緒美」は素敵だ、NAOMI と書くとまるで西洋人のようだ、と、そう思ったのが始まりで、それから次第に彼女に注意し出したのです。不思議なもので名前がハイカラだとなると、顔だちなども何処(どこ)か西洋人臭く、そうして大そう悧巧(りこう)そうに見え、「こんな所の女給にして置くのは惜しいもんだ」と考えるようになったのです。"
## [4] "実際ナオミの顔だちは、(断って置きますが、私はこれから彼女の名前を片仮名で書くことにします。どうもそうしないと感じが出ないのです)活動女優のメリー・ピクフォードに似たところがあって、確かに西洋人じみていました。これは決して私のひいき眼ではありません。私の妻となっている現在でも多くの人がそう云うのですから、事実に違いないのです。そして顔だちばかりでなく、彼女を素っ裸にして見ると、その体つきが一層西洋人臭いのですが、それは勿論後になってから分ったことで、その時分には私もそこまでは知りませんでした。ただおぼろげに、きっとああ云うスタイルなら手足の恰好(かっこう)も悪くはなかろうと、着物の着こなし工合から想像していただけでした。"
## [5] "一体十五六の少女の気持と云うものは、肉親の親か姉妹ででもなければ、なかなか分りにくいものです。だからカフエエにいた頃のナオミの性質がどんなだったかと云われると、どうも私には明瞭(めいりょう)な答えが出来ません。恐らくナオミ自身にしたって、あの頃はただ何事も夢中で過したと云うだけでしょう。が、ハタから見た感じを云えば、孰方(どっち)かと云うと、陰鬱(いんうつ)な、無口な児のように思えました。顔色なども少し青みを帯びていて、譬(たと)えばこう、無色透明な板ガラスを何枚も重ねたような、深く沈んだ色合をしていて、健康そうではありませんでした。これは一つにはまだ奉公に来たてだったので、外の女給のようにお白粉(しろい)もつけず、お客や朋輩(ほうばい)にも馴染(なじみ)がうすく、隅の方に小さくなって黙ってチョコチョコ働いていたものだから、そんな風に見えたのでしょう。そして彼女が悧巧そうに感ぜられたのも、やっぱりそのせいだったかも知れません。"
## [6] "ここで私は、私自身の経歴を説明して置く必要がありますが、私は当時月給百五十円を貰(もら)っている、或る電気会社の技師でした。私の生れは栃木県の宇都宮在で、国の中学校を卒業すると東京へ来て蔵前(くらまえ)の高等工業へ這入り、そこを出てから間もなく技師になったのです。そして日曜を除く外は、毎日芝口の下宿屋から大井町の会社へ通っていました。"
tail(chijin)
## [1] "ユスタスの後にも、第二第三のユスタスが出来たことは勿論(もちろん)ですが、今では私は、我ながら不思議に思うくらい大人しいものです。人間と云うものは一遍恐ろしい目に会うと、それが強迫観念になって、いつまでも頭に残っていると見え、私は未(いま)だに、嘗(かつ)てナオミに逃げられた時の、あの恐ろしい経験を忘れることが出来ないのです。「あたしの恐ろしいことが分ったか」と、そう云った彼女の言葉が、今でも耳にこびり着いているのです。彼女の浮気と我が儘(まま)とは昔から分っていたことで、その欠点を取ってしまえば彼女の値打ちもなくなってしまう。浮気な奴だ、我が儘な奴だと思えば思うほど、一層可愛(かわい)さが増して来て、彼女の罠(わな)に陥ってしまう。ですから私は、怒れば尚更(なおさら)自分の負けになることを悟っているのです。"
## [2] "自信がなくなると仕方がないもので、目下の私は、英語などでも到底彼女には及びません。実地に附き合っているうちに自然と上達したのでしょうが、夜会の席で婦人や紳士に愛嬌(あいきょう)を振りまきながら、彼女がぺらぺらまくし立てるのを聞いていると、何しろ発音は昔から巧(うま)かったのですから、変に西洋人臭くって、私には聞きとれないことがよくあります。そうして彼女は、ときどき私を西洋流に「ジョージ」と呼びます。"
## [3] "これで私たち夫婦の記録は終りとします。これを読んで、馬鹿々々(ばかばか)しいと思う人は笑って下さい。教訓になると思う人は、いい見せしめにして下さい。私自身は、ナオミに惚(ほ)れているのですから、どう思われても仕方がありません。"
## [4] "ナオミは今年二十三で私は三十六になります。"
## [5] ""
## [6] ""
Alternately you can download the compressed zip file from the Aozora bunkō and read in the resulting txt file. We’ll also peek at the header and footer, and keep just the novel. The index numbers are slightly different, but the process is the same.
chijin_txt <- read_lines("chijinno_ai.txt",
locale = locale(encoding = "SHIFT-JIS"))
chijin <- chijin_txt[20:2460] ## keep the novel, drop the metadata
However we generated the object chijin, we should make use of the chapter break markers. In the Aozora bunko text, chapter breaks begin with [#5字下げ], but that could get garbled in tokenization, so we’ll use regex to replace those lines with a distinctive marker. The regex string below uses ^ to mark the beginning of the line and $ to mark the end of the line. The search string __"^[#5字下げ].*$"__ grabs all of every beginning with [#5字下げ]. For replacement, I’ve chosen the arbitrary but distinctive string CHAPTERZZZZZ, but the exact string doesn’t matter, so long as
Thus, you could also use NONSENSE or CHUNK or XYLOPHONE
chijin <- str_replace(chijin,"^[#5字下げ].*$","CHAPTERZZZZZ")
Now let’s save the file for upload to NINJAL. Specify the folder or directory location as desired.
writeLines(chijin, "new_chijin.txt") ## write the file for upload
Now we’ll process the text using the NINJAL tokenizer http://chamame.ninjal.ac.jp/. Choose csv and UTF-8 output
By default, the website includes many linguistic details
Those default settings will produce a large file, but some of it is useful. The 品詞 will give the general part of speech (e.g., 名詞、助動詞) while 書字形.基本形 gives the root form of verbs, converting, for example, 思い to 思う. Alternately, I have not yet found a use for 発音. Select or deselect options as desired.
You should also select the appropriate tokenizer. For Chijin no ai the logical choice is 現代語.
Now process the text file by hitting ファイルから解析. When the file downloads, it may automatically open in Excel. Ignore that. We’ll work in R.
We can open the file using the Import Dataset pulldown in R (in the top right pane) . . . From Text (base) is good
Deselect “Strings as Factors” and give the file an appropriate name. Then hit “Open”.
If you want to do this as a script, it will look something like this
Tanizaki <- read.csv("webchamame_20190414234619.csv", comment.char="#", stringsAsFactors=FALSE)
For word frequency analysis we can use either the raw form of the tokens (書字形..表層形.) or the stemmed form (書字形.基本形). It’s probably best to let RStudio auto-complete those names.
It’s easy to turn a vector of individual observations (tokens) into a frequency table. First, we’ll create a frequency table using table and then convert that to a data frame with data.frame.
Next, we’ll sort the data frame with the order command. The syntax of order is almost identical to subsetting. If we wanted the tenth row and all the columns of Tanizaki_words.df we’d write Tanizaki_words.df[10,]. But rather than subset, we want to order the rows of the data frame by frequency (ranked high to low). Therefore, instead of Tanizaki_words.df[10,] we write Tanizaki_words.df[order(-Tanizaki_words.df$Freq),]. So, here are the 50 most common raw tokens in Tanizaki’s Naomi (痴人の愛)
Tanizaki_words.df <- data.frame(table(Tanizaki$書字形..表層形.))
Tanizaki_words.df <- Tanizaki_words.df[order(-Tanizaki_words.df$Freq),]
Tanizaki_words.df[1:50,]
## Var1 Freq
## 27 、 8141
## 1890 の 5644
## 1842 に 4453
## 1634 て 4171
## 1905 は 3677
## 1383 た 3397
## 2686 を 3000
## 1675 と 2865
## 48 《 2825
## 49 》 2825
## 527 が 2737
## 39 。 2502
## 50 「 2192
## 52 」 2164
## 1635 で 2114
## 2456 も 1774
## 1056 し 1507
## 526 か 1174
## 1772 ない 1163
## 1385 だ 1115
## 6870 私 1090
## 38 … 1084
## 1768 な 1038
## 637 から 994
## 2687 ん 932
## 232 い 845
## 2321 まし 789
## 1779 ナオミ 702
## 1653 です 682
## 1338 そう 669
## 2558 よう 645
## 2953 云う 606
## 1372 それ 601
## 2184 へ 592
## 2555 よ 583
## 317 いる 560
## 1650 でし 542
## 945 こと 540
## 4819 彼女 533
## 1361 その 531
## 33 ? 512
## 46 ] 488
## 44 [ 487
## 56 # 487
## 3242 傍点 438
## 1052 さん 369
## 1286 する 353
## 3145 何 339
## 2327 ます 336
## 1139 じゃ 334
And here are the 50 most common kanji compounds, tokens with at least one kanji. We’ll use str_detect (which returns a binary TRUE or FALSE) and the regex string \\p{Han}. Then we’ll wrap str_detect in which to get the index numbers, and use those to subset the data frame.
Tanizaki_kanji.df <- Tanizaki_words.df[which(str_detect(Tanizaki_words.df$Var1,"\\p{Han}")),]
Tanizaki_kanji.df[1:50,]
## Var1 Freq
## 6870 私 1090
## 2953 云う 606
## 4819 彼女 533
## 3242 傍点 438
## 3145 何 339
## 5795 来 309
## 2690 一 308
## 5560 方 287
## 2957 云っ 277
## 5643 時 229
## 2982 人 193
## 3260 僕 193
## 4259 女 165
## 7364 自分 161
## 6402 熊谷 148
## 2942 云い 147
## 7637 見 147
## 6039 気 144
## 6195 浜田 140
## 522 お前 138
## 2862 中 138
## 5798 来る 134
## 8588 顔 130
## 8391 間 129
## 3005 今 125
## 4410 家 123
## 7561 行っ 123
## 7873 譲治 122
## 6580 男 121
## 2923 事 120
## 3520 前 118
## 5578 日 117
## 4595 己 110
## 2930 二 106
## 4945 思っ 104
## 3846 君 96
## 7663 見る 96
## 6765 眼 94
## 7629 西洋 94
## 3609 十 93
## 3413 出 92
## 3151 何処 90
## 4692 度 90
## 2763 上 85
## 5179 所 85
## 2936 二人 84
## 3684 又 81
## 5185 手 79
## 2958 云わ 77
## 3418 出し 77
Even this rudimentary analysis is suggestive: the text is full of first and second person pronouns and proper names.
We can also subset the data frame by parts of speech, using the 品詞 columns, and count by root form.
Tanizaki_verbs.df <- Tanizaki[which(Tanizaki$品詞=="動詞-一般"),]
Tanizaki_verbs.df <- data.frame(table(Tanizaki_verbs.df$書字形.基本形.))
Tanizaki_verbs.df <- Tanizaki_verbs.df[order(-Tanizaki_verbs.df$Freq),]
Tanizaki_verbs.df[1:20,]
## Var1 Freq
## 273 云う 1141
## 609 思う 248
## 345 出る 122
## 351 分る 122
## 969 知る 103
## 406 取る 90
## 545 帰る 79
## 699 持つ 74
## 1008 立つ 69
## 1051 聞く 68
## 518 寝る 62
## 970 知れる 56
## 1216 踊る 56
## 1041 考える 52
## 1106 見える 52
## 962 着る 49
## 833 歩く 47
## 1011 笑う 44
## 1253 這入る 44
## 344 出かける 39
Here too, basic word frequency gives us a sense of the novel. The verb 云う is the most common and, indeed, much of the novel occurs in dialogue. We can calculate the preponderance of 云う as a percentage of total verbs in the novel
Tanizaki_verbs.df$Freq[which(Tanizaki_verbs.df$Var1=="云う")]/
sum(Tanizaki_verbs.df$Freq)*100
## [1] 16.09082
As a follow up question (and a model of code re-use) let’s run the exact same analysis on proper nouns. To allow for people 人名 and places 地名, we’ll use str_detect to look for “名詞-固有名詞”, which will capture both “名詞-固有名詞-人名” and “名詞-固有名詞-地名”
Tanizaki_names.df <- Tanizaki[which(str_detect(Tanizaki$品詞, "名詞-固有名詞-")),]
Tanizaki_names.df <- data.frame(table(Tanizaki_names.df$書字形.基本形.))
Tanizaki_names.df <- Tanizaki_names.df[order(-Tanizaki_names.df$Freq),]
Tanizaki_names.df[1:10,]
## Var1 Freq
## 81 ナオミ 702
## 255 熊谷 148
## 244 浜田 140
## 297 譲治 122
## 185 大森 48
## 238 河合 43
## 219 日本 42
## 308 鎌倉 32
## 228 東京 17
## 16 うま 16
Not surprisingly, the most common proper noun is Naomi.
Now let’s visualize the relative frequency of verbs in Naomi. To lock the words into order of frequency, we’ll create a factor. We’ve neglected factors thus far, except to avoid them with stringsAsFactors=FALSE. Factors are strings with underlying numeric values. “Freshman,” “sophomore,” “junior,” “senior” and “first,” “second,” “third” are examples of strings that would logically make good factors. Because factors are in between strings and numbers they are both useful and confusing: it’s easy to get the wrong aspect of a factor and get a number when you want a word. With that caveat, let’s create a new variable, the verbs in Meiroku zasshi in order of frequency.
Tanizaki_verbs.df$ordered <- reorder(Tanizaki_verbs.df$Var1, Tanizaki_verbs.df$Freq)
Now we’ll make use of that new variable in ggplot. We’ll subset the data to 20 verbs within the ggplot command, and use coord_flip() to make the bar horizontal.
ggplot(Tanizaki_verbs.df[c(1:20),], aes(ordered,Freq)) +
geom_col() +
coord_flip() +
theme_gray(base_family = "Osaka") +
ggtitle("Verb frequency in Naomi") +
xlab("Verbs") +
ylab("Frequency")
Finally, let’s look at how word use varies over time within the novel. We’ll smash all the lines into one long string, and then split them apart using the arbitrary tag CHAPTERZZZZZ.
Tanizaki_text <- paste(Tanizaki$書字形..表層形., collapse = " ")
Tanizaki_chapters <- data.frame(t(str_split(Tanizaki_text, "CHAPTER", simplify = TRUE)))
colnames(Tanizaki_chapters) <- "text"
Tanizaki_chapters$number <- as.numeric(row.names(Tanizaki_chapters))
Now let’s count some strings and graph the counts, playing with color to conclude with a fun data visualization. We’ll specify the data frame in the ggplot command, but the variables in geom_line.
Tanizaki_chapters$日本 <- str_count(Tanizaki_chapters$text,"日本")
Tanizaki_chapters$西洋 <- str_count(Tanizaki_chapters$text,"西洋")
ggplot(Tanizaki_chapters, aes(西洋,日本)) +
geom_point() +
theme_grey(base_family = "Osaka")
Mentions of “the West” and “Japan” correlate rather strongly. A formal metric (the correlation coefficient of 0.66) corresponds to the scatterplot: Tanizaki discussed “the West” and “Japan” together. We can also graph the word use over “chapter time” as below.
How do you make a “chapter time” graph? That’s a topic for a subsequent workshop.
expression | meaning | example |
---|---|---|
\\p{Hiragana} | Hiragana | ぁ あ ぃ い ぅ う ぇ え ぉ お か が き ぎ く |
\\p{Katakana} | Katakana (Full Width) | ァ ア ィ イ ゥ ウ ェ エ ォ オ |
\\p{Han} | Kanji | 漢字 日本語 文字 言語 言葉 |
[\u3000-\u303F] | Japanese Symbols and Punctuation | 。 〃 〄 々 〆 〇 〈 〉 《 》 「 」 |
[\uFF5F-\uFF9F] | Katakana and Punctuation (Half Width) | ⦅ ⦆ 。 「 」 、 ・ ヲ ァ ィ ゥ ェ ォ ャ |
expression | meaning |
---|---|
\\s | White space |
\\S | Not white space |
\\w | Word characters, e.g. letters, numbers and underscores |
\\W | Non-word characters, such as the spaces and punctuation between words |
\\d | Digits |
\\D | Non-digits |
\\b | Word boundaries — almost the same as non-word characters, but includes the beginning and end of lines |
\\B | Negation of ‘word boundaries’: any position between two word characters as well as at any position between two non-word characters |
expression | meaning |
---|---|
^ | Start of line |
$ | End of line |
If you scrape directly from a web page, the results are likely going to be full of html tags. Now that we know regex, we can remove those easily. Let’s try the Aozora bunko edition of Hayashi’s Ukigumo. R does a good job of recognizing UTF-8 encoding, but the Aozora bunko is encoded in Shift-JIS. Fortunately, the readr package helpfully allows us to specify the encoding. The read_lines command in reader is almost the same as the readLines command in base R.
library(readr)
messy.ukigumo <- read_lines("https://www.aozora.gr.jp/cards/000006/files/1869_33656.html", locale = locale(encoding = "SHIFT_JIS"))
messy.ukigumo[20:25]
## [1] "<br />"
## [2] " 浮雲はしがき<br />"
## [3] "<br />"
## [4] "<div class=\"jisage_2\" style=\"margin-left: 2em\">"
## [5] " <ruby><rb>薔薇</rb><rp>(</rp><rt>ばら</rt><rp>)</rp></ruby>の花は<ruby><rb>頭</rb><rp>(</rp><rt>かしら</rt><rp>)</rp></ruby>に咲て活人は絵となる世の中独り文章<ruby><rb>而已</rb><rp>(</rp><rt>のみ</rt><rp>)</rp></ruby>は<ruby><rb>黴</rb><rp>(</rp><rt>かび</rt><rp>)</rp></ruby>の生えた<ruby><rb>陳奮翰</rb><rp>(</rp><rt>ちんぷんかん</rt><rp>)</rp></ruby>の四角張りたるに<ruby><rb>頬返</rb><rp>(</rp><rt>ほおがえ</rt><rp>)</rp></ruby>しを附けかね又は舌足らずの<ruby><rb>物言</rb><rp>(</rp><rt>ものいい</rt><rp>)</rp></ruby>を学びて口に<ruby><rb>涎</rb><rp>(</rp><rt>よだれ</rt><rp>)</rp></ruby>を流すは<ruby><rb>拙</rb><rp>(</rp><rt>つたな</rt><rp>)</rp></ruby>しこれはどうでも言文<ruby><rb>一途</rb><rp>(</rp><rt>いっと</rt><rp>)</rp></ruby>の事だと思立ては矢も<ruby><rb>楯</rb><rp>(</rp><rt>たて</rt><rp>)</rp></ruby>もなく文明の風改良の熱一度に寄せ来るどさくさ紛れお先<ruby><rb>真闇</rb><rp>(</rp><rt>まっくら</rt><rp>)</rp></ruby><ruby><rb>三宝荒神</rb><rp>(</rp><rt>さんぽうこうじん</rt><rp>)</rp></ruby>さまと春のや先生を頼み<ruby><rb>奉</rb><rp>(</rp><rt>たてまつ</rt><rp>)</rp></ruby>り<ruby><rb>欠硯</rb><rp>(</rp><rt>かけすずり</rt><rp>)</rp></ruby>に<ruby><rb>朧</rb><rp>(</rp><rt>おぼろ</rt><rp>)</rp></ruby>の月の<ruby><rb>雫</rb><rp>(</rp><rt>しずく</rt><rp>)</rp></ruby>を受けて墨<ruby><rb>摺流</rb><rp>(</rp><rt>すりなが</rt><rp>)</rp></ruby>す空のきおい夕立の雨の一しきりさらさらさっと書流せばアラ<ruby><rb>無情</rb><rp>(</rp><rt>うたて</rt><rp>)</rp></ruby>始末にゆかぬ浮雲めが<ruby><rb>艶</rb><rp>(</rp><rt>やさ</rt><rp>)</rp></ruby>しき月の面影を思い<ruby><rb>懸</rb><rp>(</rp><rt>がけ</rt><rp>)</rp></ruby>なく<ruby><rb>閉籠</rb><rp>(</rp><rt>とじこめ</rt><rp>)</rp></ruby>て<ruby><rb>黒白</rb><rp>(</rp><rt>あやめ</rt><rp>)</rp></ruby>も分かぬ<ruby><rb>烏夜玉</rb><rp>(</rp><rt>うばたま</rt><rp>)</rp></ruby>のやみらみっちゃな小説が出来しぞやと我ながら肝を<ruby><rb>潰</rb><rp>(</rp><rt>つぶ</rt><rp>)</rp></ruby>してこの書の巻端に序するものは<br />"
## [6] "<br />"
messy.ukigumo[50:55]
## [1] "「何故」<br />"
## [2] "「何故と言って、彼奴は馬鹿だ、課長に向って<ruby><rb>此間</rb><rp>(</rp><rt>こないだ</rt><rp>)</rp></ruby>のような事を言う所を見りゃア、<ruby><rb>弥</rb><rp>(</rp><rt>いよいよ</rt><rp>)</rp></ruby>馬鹿だ」<br />"
## [3] "「あれは全体課長が悪いサ、自分が不条理な事を言付けながら、何にもあんなに頭ごなしにいうこともない」<br />"
## [4] "「それは課長の方が或は不条理かも知れぬが、しかし<ruby><rb>苟</rb><rp>(</rp><rt>いやしく</rt><rp>)</rp></ruby>も長官たる者に向って抵抗を試みるなぞというなア、馬鹿の骨頂だ。まず考えて見給え、山口は何んだ、属吏じゃアないか。属吏ならば、<ruby><rb>仮令</rb><rp>(</rp><rt>たと</rt><rp>)</rp></ruby>い課長の言付を条理と思ったにしろ思わぬにしろ、ハイハイ言ってその通り<ruby><rb>処弁</rb><rp>(</rp><rt>しょべん</rt><rp>)</rp></ruby>して往きゃア、職分は尽きてるじゃアないか。<ruby><rb>然</rb><rp>(</rp><rt>しか</rt><rp>)</rp></ruby>るに彼奴のように、苟も課長たる者に向ってあんな差図がましい事を……」<br />"
## [5] "「イヤあれは指図じゃアない、注意サ」<br />"
## [6] "「フム<ruby><rb>乙</rb><rp>(</rp><rt>おつ</rt><rp>)</rp></ruby>う山口を弁護するネ、やっぱり同病<ruby><rb>相憐</rb><rp>(</rp><rt>あいあわ</rt><rp>)</rp></ruby>れむのか、アハアハアハ」<br />"
To remove (almost) all of the html, we’ll use the regex string __"<.*?>". That searches for everything between the brackets <__ and >.
clean.ukigumo <- str_replace_all(string = messy.ukigumo, pattern = "<.*?>",replacement = "")
clean.ukigumo[20:25]
## [1] ""
## [2] " 浮雲はしがき"
## [3] ""
## [4] ""
## [5] " 薔薇(ばら)の花は頭(かしら)に咲て活人は絵となる世の中独り文章而已(のみ)は黴(かび)の生えた陳奮翰(ちんぷんかん)の四角張りたるに頬返(ほおがえ)しを附けかね又は舌足らずの物言(ものいい)を学びて口に涎(よだれ)を流すは拙(つたな)しこれはどうでも言文一途(いっと)の事だと思立ては矢も楯(たて)もなく文明の風改良の熱一度に寄せ来るどさくさ紛れお先真闇(まっくら)三宝荒神(さんぽうこうじん)さまと春のや先生を頼み奉(たてまつ)り欠硯(かけすずり)に朧(おぼろ)の月の雫(しずく)を受けて墨摺流(すりなが)す空のきおい夕立の雨の一しきりさらさらさっと書流せばアラ無情(うたて)始末にゆかぬ浮雲めが艶(やさ)しき月の面影を思い懸(がけ)なく閉籠(とじこめ)て黒白(あやめ)も分かぬ烏夜玉(うばたま)のやみらみっちゃな小説が出来しぞやと我ながら肝を潰(つぶ)してこの書の巻端に序するものは"
## [6] ""
clean.ukigumo[50:55]
## [1] "「何故」"
## [2] "「何故と言って、彼奴は馬鹿だ、課長に向って此間(こないだ)のような事を言う所を見りゃア、弥(いよいよ)馬鹿だ」"
## [3] "「あれは全体課長が悪いサ、自分が不条理な事を言付けながら、何にもあんなに頭ごなしにいうこともない」"
## [4] "「それは課長の方が或は不条理かも知れぬが、しかし苟(いやしく)も長官たる者に向って抵抗を試みるなぞというなア、馬鹿の骨頂だ。まず考えて見給え、山口は何んだ、属吏じゃアないか。属吏ならば、仮令(たと)い課長の言付を条理と思ったにしろ思わぬにしろ、ハイハイ言ってその通り処弁(しょべん)して往きゃア、職分は尽きてるじゃアないか。然(しか)るに彼奴のように、苟も課長たる者に向ってあんな差図がましい事を……」"
## [5] "「イヤあれは指図じゃアない、注意サ」"
## [6] "「フム乙(おつ)う山口を弁護するネ、やっぱり同病相憐(あいあわ)れむのか、アハアハアハ」"
Outside of removing the metadata in the header, the text is ready to go.