美女扒开腿免费视频_蜜桃传媒一区二区亚洲av_先锋影音av在线_少妇一级淫片免费放播放_日本泡妞xxxx免费视频软件_一色道久久88加勒比一_熟女少妇一区二区三区_老司机免费视频_潘金莲一级黄色片_精品国产精品国产精品_黑人巨大猛交丰满少妇

COMP 330代做、Python設(shè)計(jì)程序代寫

時(shí)間:2024-04-02  來源:  作者: 我要糾錯(cuò)



COMP 330 Assignment #5
1 Description
In this assignment, you will be implementing a regularized, logistic regression to classify text documents. The implementation will be in Python, on top of Spark. To handle the large data set that we will be
giving you, it is necessary to use Amazon AWS.
You will be asked to perform three subtasks: (1) data preparation, (2) learning (which will be done via
gradient descent) and (3) evaluation of the learned model.
Note: It is important to complete HW 5 and Lab 5 before you really get going on this assignment. HW
5 will give you an opportunity to try out gradient descent for learning a model, and Lab 5 will give you
some experience with writing efficient NumPy code, both of which will be important for making your A5
experience less challenging!
2 Data
You will be dealing with a data set that consists of around 170,000 text documents and a test/evaluation
data set that consists of 18,700 text documents. All but around 6,000 of these text documents are Wikipedia
pages; the remaining documents are descriptions of Australian court cases and rulings. At the highest level,
your task is to build a classifier that can automatically figure out whether a text document is an Australian
court case.
We have prepared three data sets for your use.
1. The Training Data Set (1.9 GB of text). This is the set you will use to train your logistic regression
model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
2. The Testing Data Set (200 MB of text). This is the set you will use to evaluate your model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
3. The Small Data Set (37.5 MB of text). This is for you to use for training and testing of your model on
a smaller data set:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/SmallTrainingDataOneLinePerDoc.txt
Some Data Details to Be Aware Of. You should download and look at the SmallTrainingData.txt
file before you begin. You’ll see that the contents are sort of a pseudo-XML, where each text document
begins with a <doc id = ... > tag, and ends with </doc>. All documents are contained on a single
line of text.
Note that all of the Australia legal cases begin with something like <doc id = ‘‘AU1222’’ ...>;
that is, the doc id for an Australian legal case always starts with AU. You will be trying to figure out if the
document is an Australian legal case by looking only at the contents of the document.
1
3 The Tasks
There are three separate tasks that you need to complete to finish the assignment. As usual, it makes
sense to implement these and run them on the small data set before moving to the larger one.
3.1 Task 1
First, you need to write Spark code that builds a dictionary that includes the 20,000 most frequent words
in the training corpus. This dictionary is essentially an RDD that has the word as the key, and the relative
frequency position of the word as the value. For example, the value is zero for the most frequent word, and
19,999 for the least frequent word in the dictionary.
To get credit for this task, give us the frequency position of the words “applicant”, “and”, “attack”,
“protein”, and “car”. These should be values from 0 to 19,999, or -1 if the word is not in the dictionary,
because it is not in the to 20,000.
Note that accomplishing this will require you to use a variant of your A4 solution. If you do not trust
your A4 solution and would like mine, you can post a private request on Piazza.
3.2 Task 2
Next, you will convert each of the documents in the training set to a TF-IDF vector. You will then use
a gradient descent algorithm to learn a logistic regression model that can decide whether a document is
describing an Australian court case or not. Your model should use l2 regularization; you can play with in
things a bit to determine the parameter controlling the extent of the regularization. We will have enough
data that you might find that the regularization may not be too important (that is, it may be that you get good
results with a very small weight given to the regularization constant).
I am going to ask that you not just look up the gradient descent algorithm on the Internet and implement
it. Start with the LLH function from class, and then derive your own gradient descent algorithm. We can
help with this if you get stuck.
At the end of each iteration, compute the LLH of your model. You should run your gradient descent
until the change in LLH across iterations is very small.
Once you have completed this task, you will get credit by (a) writing up your gradient update formula,
and (b) giving us the fifty words with the largest regression coefficients. That is, those fifty words that are
most strongly related with an Australian court case.
3.3 Task 3
Now that you have trained your model, it is time to evaluate it. Here, you will use your model to predict
whether or not each of the testing points correspond to Australian court cases. To get credit for this task,
you need to compute for us the F1 score obtained by your classifier—we will use the F1 score obtained as
one of the ways in which we grade your Task 3 submission.
Also, I am going to ask you to actually look at the text for three of the false positives that your model
produced (that is, Wikipedia articles that your model thought were Australian court cases). Write paragraph
describing why you think it is that your model was fooled. Were the bad documents about Australia? The
legal system?
If you don’t have three false positives, just use the ones that you had (if any).
4 Important Considerations
Some notes regarding training and implementation. As you implement and evaluate your gradient descent algorithm, here are a few things to keep in mind.
2
1. To get good accuracy, you will need to center and normalize your data. That is, transform your data so
that the mean of each dimension is zero, and the standard deviation is one. That is, subtract the mean
vector from each data point, and then divide the result by the vector of standard deviations computed
over the data set.
2. When classifying new data, a data point whose dot product with the set of regression coefs is positive
is a “yes”, a negative is a “no” (see slide 15 in the GLM lecture). You will be trying to maximize the
F1 of your classifier and you can often increase the F1 by choosing a different cutoff between “yes”
and “no” other than zero. Another thing that you can do is to add another dimension whose value is
one in each data point (we discussed this in class). The learning process will then choose a regression
coef for this special dimension that tends to balance the “yes” and “no” nicely at a cutoff of zero.
However, some students in the past have reported that this can increase the training time.
3. Students sometimes face overflow problems, both when computing the LLH and when computing the
gradient update. Some things that you can do to avoid this are, (1) use np.exp() which seems to
be quite robust, and (2) transform your data so that the standard deviation is smaller than one—if you
have problems with a standard deviation of one, you might try 10−2 or even 10−5
. You may need to
experiment a bit. Such are the wonderful aspects of implementing data science algorithms in the real
world!
4. If you find that your training takes more than a few hours to run to convergence on the largest data set,
it likely means that you are doing something that is inherently slow that you can speed up by looking
at your code carefully. One thing: there is no problem with first training your model on a small sample
of the large data set (say, 10% of the documents) then using the result as an initialization, and continue
training on the full data set. This can speed up the process of reaching convergence.
Big data, small data, and grading. The first two tasks are worth three points, the last four points. Since it
can be challenging to run everything on a large data set, we’ll offer you a small data option. If you train your
data on TestingDataOneLinePerDoc.txt, and then test your data on SmallTrainingDataOneLinePerDoc.twe’ll take off 0.5 points on Task 2 and 0.5 points on Task 3. This means you can still get an A, and
you don’t have to deal with the big data set. For the possibility of getting full credit, you can train
your data on the quite large TrainingDataOneLinePerDoc.txt data set, and then test your data
on TestingDataOneLinePerDoc.txt.
4.1 Machines to Use
If you decide to try for full credit on the big data set you will need to run your Spark jobs three to five
machines as workers, each having around 8 cores. If you are not trying for the full credit, you can likely
get away with running on a smaller cluster. Remember, the costs WILL ADD UP QUICKLY IF YOU
FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and shut down your cluster as soon as
you are done working. You can always create a new one easily when you begin your work again.
4.2 Turnin
Create a single document that has results for all three tasks. Make sure to be very clear whether you
tried the big data or small data option. Turn in this document as well as all of your code. Please zip up all
of your code and your document (use .gz or .zip only, please!), or else attach each piece of code as well as
your document to your submission individually. Do NOT turn in anything other than your Python code and
請(qǐng)加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp













 

標(biāo)簽:

掃一掃在手機(jī)打開當(dāng)前頁
  • 上一篇:AIC2100代寫、Python設(shè)計(jì)程序代做
  • 下一篇:COMP3334代做、代寫Python程序語言
  • 無相關(guān)信息
    昆明生活資訊

    昆明圖文信息
    蝴蝶泉(4A)-大理旅游
    蝴蝶泉(4A)-大理旅游
    油炸竹蟲
    油炸竹蟲
    酸筍煮魚(雞)
    酸筍煮魚(雞)
    竹筒飯
    竹筒飯
    香茅草烤魚
    香茅草烤魚
    檸檬烤魚
    檸檬烤魚
    昆明西山國家級(jí)風(fēng)景名勝區(qū)
    昆明西山國家級(jí)風(fēng)景名勝區(qū)
    昆明旅游索道攻略
    昆明旅游索道攻略
  • 短信驗(yàn)證碼平臺(tái) 理財(cái) WPS下載

    關(guān)于我們 | 打賞支持 | 廣告服務(wù) | 聯(lián)系我們 | 網(wǎng)站地圖 | 免責(zé)聲明 | 幫助中心 | 友情鏈接 |

    Copyright © 2025 kmw.cc Inc. All Rights Reserved. 昆明網(wǎng) 版權(quán)所有
    ICP備06013414號(hào)-3 公安備 42010502001045

    美女扒开腿免费视频_蜜桃传媒一区二区亚洲av_先锋影音av在线_少妇一级淫片免费放播放_日本泡妞xxxx免费视频软件_一色道久久88加勒比一_熟女少妇一区二区三区_老司机免费视频_潘金莲一级黄色片_精品国产精品国产精品_黑人巨大猛交丰满少妇
    国产18无套直看片| 国产日产精品一区二区三区的介绍| 久久久久99精品成人| 国产精品扒开腿做爽爽爽a片唱戏| 91人妻一区二区三区蜜臀| 奇米网一区二区| 五月天色婷婷丁香| 老司机成人免费视频| 精品人体无码一区二区三区| gv天堂gv无码男同在线观看| 亚洲熟女毛茸茸| 又黄又爽又色的视频| 亚洲av综合色区无码另类小说| 国产又粗又猛又爽又黄| 日本美女视频网站| 亚洲第一香蕉网| 911国产在线| 高清中文字幕mv的电影| 捆绑裸体绳奴bdsm亚洲| 国产又黄又粗视频| 手机av在线不卡| 国精产品视频一二二区| 日韩在线观看视频一区二区| 欧美日韩一区二区区| 国产精品jizz| 卡通动漫亚洲综合| 性欧美丰满熟妇xxxx性久久久| 艳母动漫在线看| 九九热免费在线| 搜索黄色一级片| 亚洲精品乱码久久久久久蜜桃欧美| www.555国产精品免费| 国产男女猛烈无遮挡a片漫画| mm131美女视频| 波多野结衣久久久久| 国模大尺度视频| 精品无码人妻一区| 久久久久人妻一区精品色| www.超碰在线观看| 久久久久久久久久久国产精品| 九九热免费在线| 国产白嫩美女无套久久| 欧洲美女女同性互添| 在线观看天堂av| 免费a v网站| 成人免费精品动漫网站| 亚洲第一综合网| 欧美熟妇精品一区二区蜜桃视频| 四虎精品免费视频| 国产一二三四五区| 三叶草欧洲码在线| 国产男女无遮挡猛进猛出| 中文字幕精品视频在线| 免费看特级毛片| 欧美做受高潮6| 野花视频免费在线观看| 男人操女人的视频网站| 精品国产欧美日韩不卡在线观看| 国产一二三av| 国产精品久久久视频| 国产精品理论在线| 极品久久久久久久| 国产wwwwxxxx| 俄罗斯女人裸体性做爰| xxxx国产视频| 艳妇乳肉亭妇荡乳av| av网页在线观看| 久久久久久成人网| 国产真实乱在线更新| 午夜69成人做爰视频| 男人添女人荫蒂国产| 9.1在线观看免费| 久久精品国产亚洲av久| 白白色免费视频| 日韩福利小视频| 91丨porny丨对白| 永久免费成人代码| 成人在线观看免费完整| 怡红院一区二区| 99在线视频免费| 中文字幕人妻一区二区三区| wwwwww日本| 在线观看成人毛片| 伊人网在线视频观看| 国产成人无码aa精品一区| 亚洲自拍偷拍精品| 永久免费看片直接| free性中国hd国语露脸| 91n在线视频| 亚洲专区区免费| 91视频综合网| 一区二区三区久久久久| 一区二区三区影视| 亚洲一区二区三区四区av| 久久久久久久毛片| 波多野结衣三级视频| 国产午夜精品久久久久久久久| 中文字幕在线视频播放| 日本激情视频一区二区三区| 亚洲一区二区三区四区五区六区| 波多野结衣久久久久| 国产伦精品一区二区三区妓女 | 91人人澡人人爽| 五月天婷婷色综合| 全黄一级裸体片| 天堂久久久久久| 中文字幕人妻一区| 日本女人性视频| 国产少妇在线观看| 欧美第一页在线观看| 国产成人一区二区在线观看| 中文字幕一区二区三区乱码不卡| www青青草原| 亚洲av综合色区无码另类小说| 久久国产精品国语对白| 国产在线免费看| 手机av在线不卡| 色婷婷粉嫩av| 无码人妻精品中文字幕| 国产麻豆a毛片| 国产黄色小视频网站| 亚洲欧美卡通动漫| 日韩a级片在线观看| 99热在线观看精品| 中文字幕手机在线观看| 日本黄色一级网站| 2025中文字幕| 51调教丨国产调教视频| 亚洲欧美视频在线播放| 99久久久久久久久久| 一区二区三区久久久久| 四虎地址8848| 国产乱淫av片| 最新中文字幕视频| 成年人免费视频播放| 婷婷伊人五月天| 波多野结衣一二三区| av男人的天堂av| 亚洲成人生活片| 欧美性xxxx图片| 成年人网站免费看| 97中文字幕在线观看| 国内自拍偷拍视频| 私库av在线播放| 国产乱国产乱老熟300部视频| √天堂中文官网8在线| 老司机深夜福利网站| www.99re6| 国产女人被狂躁到高潮小说| 极品人妻一区二区| 丝袜熟女一区二区三区| 国产精品无码在线| 日韩人妻无码一区二区三区| 亚洲国产精品无码久久久久高潮| 国产吞精囗交久久久| 实拍女处破www免费看| 亚洲毛片亚洲毛片亚洲毛片| 天天做夜夜爱爱爱| 乱码一区二区三区| 蜜桃传媒一区二区亚洲av| 熟女俱乐部一区二区视频在线| 成年人在线免费看片| 国产无套精品一区二区三区| 俄罗斯毛片基地| 怡红院一区二区三区| 国产ts丝袜人妖系列视频| 国产在线观看免费播放| 欧美日韩午夜视频| 亚洲精品国产精品乱码在线观看| 亚洲制服丝袜在线播放| 中文字幕乱妇无码av在线| 五月天激情丁香| 二区三区四区视频| 久久精品色妇熟妇丰满人妻| 91网站免费视频| 麻豆精品免费视频| 顶臀精品视频www| 国产在线观看h| 国产肥白大熟妇bbbb视频| 漂亮人妻被黑人久久精品| 丰满人妻一区二区三区大胸| 一区二区三区影视| 三级在线观看免费大全| 亚洲欧美精品久久| 欧洲猛交xxxx乱大交3| 中文字幕一二三| 97香蕉碰碰人妻国产欧美 | www久久久久久久| 久草福利在线观看| 亚洲欧美激情一区二区三区| 免费观看a级片| 亚洲精品卡一卡二| 制服丨自拍丨欧美丨动漫丨| 国产美女免费无遮挡| 狂野欧美性猛交| 国产探花在线视频| 99热在线观看精品| 蜜臀av粉嫩av懂色av| 中文字幕乱码在线人视频| 久草视频手机在线|