Theo kế hoạch giảng dạy trong học kỳ 2 năm học 2018-2019 Khoa
Khoa học Máy tính phối hợp cùng Văn phòng Các chương trình đặc biệt,
Phòng Đào tạo Đại học mở mới môn học do Giáo sư John F. Hurdle (University of Utah/Department of Biomedical Informatics) giảng dạy.
Thông tin về môn học của Prof. GS. Hurdle trên hệ thống đăng kí học phần như sau:
- Tên môn học: Xử lý văn bản Y khoa
- Mã lớp: CS339.J21.KHTN
- Môn tự chọn – ngành Khoa học máy tính, chấp nhận cho tất cả các khóa và hệ đào tạo (từ K10).
- Thời gian: thứ 3, Tiết 678, tại Phòng E.22
- Đối tượng: sinh viên toàn trường (có khả năng học tập bằng tiếng Anh) đều có thể đăng ký học.
- Lớp học sẽ có trợ giảng là GV Việt Nam (TS. Nguyễn Lưu Thùy Ngân)
Các nội dung chính của môn học:
Course Overview:
- NLP and ML comprise the foundation of the essential online tools we use every day, such as
Google’s extraordinary search engine; Netflix’s “here is a film we think you will like” suggestion
technology; and Amazon’s “what to buy next” recommendation technology. Extremely popular
with industry globally, students who know NLP/ML will be well positioned to move ahead in
their careers. This course will introduce the essential theory behind these tools and will stress
7
applying the theory to real-world problems. These tools are very easy to use poorly, so we
focus on the principled application of these tools.
Course Schedule & Topics:
- Week 1:
Introduce Prof. Hurdle; introduce the
students and other attendees; Review
syllabus and course requirements;
and Survey student and other
attendees’ programming background
and skill sets.
- Week 2:
NLP/ML Bootcamp. The rationale for
using NLP; Text as data; Linguistic
versus statistical approaches to NLP;
How NLP is used in common apps
(e.g., Google, Amazon, etc.); Career
opportunities
- Week 3:
Our tools: Jupyter Notebook and
NLTK. Introduction to the Jupyter
Notebook system; Introduction to the
Natural Language Tool Kit; Accessing
corpora and related resources.
- Week 4:
The NLP Pipeline and Preprocessing
Stages/Modules. Why pipelines?;
Overview of UIMA and UIMA-AS;
Standard pre-processing of text as
pipeline stages.
- Week 5:
Basics of Information Extraction (IE) in
a Pipeline. The foundation of NLP:
finding discrete information it text
(dictionaries, indexes, and regular
expressions); Named Entity
Recognition; Clinical texts: the Unified
Medical Language System (UMLS); IE
as a feed to ML.
- Week 6:
Evaluation of NLP and ML systems:
performance metrics. The confusion
matrix; Precision, Recall, and F-Score;
Sample size and bias; Problems with
the accuracy measure.
- Week 7: Midterm Exam
- Week 8:
Basics of Information Retrieval (IR)
using a a Pipeline. The
foundmidtermation of IR: finding a
class of documents in corpus;
Indexing documents; Building on top
of indexes (Google and Nutch); Bagof-words and sparse data.
- Week 9:
Application-specific Pipelines: Clinical
text use case. Part-of-Speech (POS)
tagging; Stop words; Dictionary
lookup; Using the UMLS in a pipeline;
Clinical context is important:
FastConText.
- Week 10:
Unsupervised Methods and Text
Annotation: Clinical text use case.
Approaches to annotating text to
measure NLP/ML performance;
Avoiding annotation: unsupervised
clustering and related methods; Brief
introduction to sub-language theory. // In-class: Brief exercise illustrating
the pain of annotating; Using CLUTO
to discover clinical sub-languages.
- Week 11:
Machine Learning Bootcamp: focus on
classification Part 1. (All lecture this
session) Classification defined;
Training and testing; Hyper-spaces
and the algorithms to explore them;
The baseline: logistic regression;
Alternatives to the baseline: Naïve
Bayes (NB), Classification trees and
variations (CT), Support Vector
Machines (SVM), and Neural
Networks (NN).
- Week 12:
Machine Learning Bootcamp: focus on
classification Part 2. (All In-class this
session) A team race: class breaks up
into Team NB, Team CT, Team SVM,
and Team NN. All teams given the
same training set, each team tunes
their model and measures
performance on training set, then
each team given the test set (no
tuning allowed here)
- Week 13:
TBA. This session will be adapted onthe-ground, tailored to topics the
students want to learn more about.
- Week 14: Troubleshooting Final Projects. Student teams present overview of their projects and brainstorm with the class on barriers/pitfalls/workarounds