Fork me on GitHub

python爬取Google翻译

请求分析

首先,我们打开开发者工具,然后打开Google翻译(translate.google.cn),

我们过滤器选择JS,发现第一个js就很可疑,先记下来。
接着,我们随便输入一些内容,便可以很轻松地抓取到这个请求:

观察GET参数,不用说都能看出来下面几个参数是重要的参数:

我们可以猜测:sl为源语言,tl为目标语言,q为要翻译的内容,tk为某种验证参数。
那么tk参数如何获得呢?

tk参数获取

打开我们最开始发现的那个js文件:
translate.google.cn/translate/releases/twsfe_w_20200622_RC00/r/js/translate_m_zh-CN.js
将其格式化,再经过漫长的搜寻,找到了下面的一串代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
var pu = function(a) {
return function() {
return a
}
},
qu = function(a, b) {
for (var c = 0; c < b.length - 2; c += 3) {
var d = b.charAt(c + 2);
d = "a" <= d ? d.charCodeAt(0) - 87 : Number(d);
d = "+" == b.charAt(c + 1) ? a >>> d: a << d;
a = "+" == b.charAt(c) ? a + d & 4294967295 : a ^ d
}
return a
},
ru = null,
su = function(a) {
if (null !== ru) var b = ru;
else {
b = pu(String.fromCharCode(84));
var c = pu(String.fromCharCode(75));
b = [b(), b()];
b[1] = c();
b = (ru = window[b.join(c())] || "") || ""
}
var d = pu(String.fromCharCode(116));
c = pu(String.fromCharCode(107));
d = [d(), d()];
d[1] = c();
c = "&" + d.join("") + "=";
d = b.split(".");
b = Number(d[0]) || 0;
for (var e = [], f = 0, g = 0; g < a.length; g++) {
var h = a.charCodeAt(g);
128 > h ? e[f++] = h: (2048 > h ? e[f++] = h >> 6 | 192 : (55296 == (h & 64512) && g + 1 < a.length && 56320 == (a.charCodeAt(g + 1) & 64512) ? (h = 65536 + ((h & 1023) << 10) + (a.charCodeAt(++g) & 1023), e[f++] = h >> 18 | 240, e[f++] = h >> 12 & 63 | 128) : e[f++] = h >> 12 | 224, e[f++] = h >> 6 & 63 | 128), e[f++] = h & 63 | 128)
}
a = b;
for (f = 0; f < e.length; f++) a += e[f],
a = qu(a, "+-a^+6");
a = qu(a, "+-3^+b+-f");
a ^= Number(d[1]) || 0;
0 > a && (a = (a & 2147483647) + 2147483648);
a %= 1E6;
return c + (a.toString() + "." + (a ^ b))
};

很明显,这段代码是一段算法,而且是经过混淆的(变量名很奇怪)。
经过验证,这就是tk参数的加密算法。把不需要部分的去除,稍微进行整理一下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
var qu = function(a, b) {
for (var c = 0; c < b.length - 2; c += 3) {
var d = b.charAt(c + 2);
d = "a" <= d ? d.charCodeAt(0) - 87 : Number(d);
d = "+" == b.charAt(c + 1) ? a >>> d: a << d;
a = "+" == b.charAt(c) ? a + d & 4294967295 : a ^ d
}
return a
},
su = function(a, tkk) {
d = tkk.split(".");
b = Number(d[0]) || 0;
for (var e = [], f = 0, g = 0; g < a.length; g++) {
var h = a.charCodeAt(g);
128 > h ? e[f++] = h: (2048 > h ? e[f++] = h >> 6 | 192 : (55296 == (h & 64512) && g + 1 < a.length && 56320 == (a.charCodeAt(g + 1) & 64512) ? (h = 65536 + ((h & 1023) << 10) + (a.charCodeAt(++g) & 1023), e[f++] = h >> 18 | 240, e[f++] = h >> 12 & 63 | 128) : e[f++] = h >> 12 | 224, e[f++] = h >> 6 & 63 | 128), e[f++] = h & 63 | 128)
}
a = b;
for (f = 0; f < e.length; f++) a += e[f],
a = qu(a, "+-a^+6");
a = qu(a, "+-3^+b+-f");
a ^= Number(d[1]) || 0;
0 > a && (a = (a & 2147483647) + 2147483648);
a %= 1E6;
return (a.toString() + "." + (a ^ b))
};

这里的tkk就在源代码里面可以找到,如下图:

execjs库的调用

接下来,我们用python的execjs库进行测试,如果没有安装请使用pip install PyExecJS进行安装。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import execjs

a=execjs.compile("""
var qu = function(a, b) {
for (var c = 0; c < b.length - 2; c += 3) {
var d = b.charAt(c + 2);
d = "a" <= d ? d.charCodeAt(0) - 87 : Number(d);
d = "+" == b.charAt(c + 1) ? a >>> d: a << d;
a = "+" == b.charAt(c) ? a + d & 4294967295 : a ^ d
}
return a
},
su = function(a, tkk) {
d = tkk.split(".");
b = Number(d[0]) || 0;
for (var e = [], f = 0, g = 0; g < a.length; g++) {
var h = a.charCodeAt(g);
128 > h ? e[f++] = h: (2048 > h ? e[f++] = h >> 6 | 192 : (55296 == (h & 64512) && g + 1 < a.length && 56320 == (a.charCodeAt(g + 1) & 64512) ? (h = 65536 + ((h & 1023) << 10) + (a.charCodeAt(++g) & 1023), e[f++] = h >> 18 | 240, e[f++] = h >> 12 & 63 | 128) : e[f++] = h >> 12 | 224, e[f++] = h >> 6 & 63 | 128), e[f++] = h & 63 | 128)
}
a = b;
for (f = 0; f < e.length; f++) a += e[f],
a = qu(a, "+-a^+6");
a = qu(a, "+-3^+b+-f");
a ^= Number(d[1]) || 0;
0 > a && (a = (a & 2147483647) + 2147483648);
a %= 1E6;
return (a.toString() + "." + (a ^ b))
};
""")

print(a.call("su",'test','442788.2585626513')) #获取tk

我们将运行得到的tk带入那个GET请求中,发现仍然可以正常得到返回json,说明tk获取成功。
于是我们就可以通过修改GET请求的参数实现获取翻译结果了。

python版算法

execjs的性能并不是很好,执行时间会很长。因此我们采用另一种方法:将js转换成python代码。
转换后大致代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def getGoogleToken(a, TKK):
def RL(a, b):
for d in range(0, len(b)-2, 3):
c = b[d + 2]
c = ord(c[0]) - 87 if 'a' <= c else int(c)
c = a >> c if '+' == b[d + 1] else a << c
a = a + c & 4294967295 if '+' == b[d] else a ^ c
return a

g = []
f = 0
while f < len(a):
c = ord(a[f])
if 128 > c:
g.append(c)
else:
if 2048 > c:
g.append((c >> 6) | 192)
else:
if (55296 == (c & 64512)) and (f + 1 < len(a)) and (56320 == (ord(a[f+1]) & 64512)):
f += 1
c = 65536 + ((c & 1023) << 10) + (ord(a[f]) & 1023)
g.append((c >> 18) | 240)
g.append((c >> 12) & 63 | 128)
else:
g.append((c >> 12) | 224)
g.append((c >> 6) & 63 | 128)
g.append((c & 63) | 128)
f += 1

e = TKK.split('.')
h = int(e[0]) or 0
t = h
for item in g:
t += item
t = RL(t, '+-a^+6')
t = RL(t, '+-3^+b+-f')
t ^= int(e[1]) or 0
if 0 > t:
t = (t & 2147483647) + 2147483648
result = t % 1000000
return str(result) + '.' + str(result ^ h)

最终代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import requests,re,json,time

class GoogleTranslator():
_host = 'translate.google.cn'
_headers = {
'Host': _host,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/x-www-form-urlencoded;charset=utf-8',
'Referer': 'https://' + _host,
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0'
}
_url = 'https://' + _host + '/translate_a/single'
_params = { #除关键参数,其它参数直接复制过来就行了
'client': 'webapp',
'sl': 'en',
'tl': 'zh-CN',
'hl': 'zh-CN',
'dt': 'at',
'dt': 'bd',
'dt': 'ex',
'dt': 'ld',
'dt': 'md',
'dt': 'qca',
'dt': 'rw',
'dt': 'rm',
'dt': 'ss',
'dt': 't',
'otf': '1',
'ssel': '0',
'tsel': '0',
'kc': '1'
}

__cookies = None
__googleTokenKey = ''
__googleTokenKeyUpdataTime = 600.0
__googleTokenKeyRetireTime = time.time() + 600.0

def __init__(self, src = 'auto', dest = 'zh-CN', tkkUpdataTime = 600.0):
self._params['sl'] = src
self._params['tl'] = dest
self.googleTokenKeyUpdataTime = tkkUpdataTime
self.__updateGoogleTokenKey()

def __updateGoogleTokenKey(self):
self.__googleTokenKey = self.__getGoogleTokenKey()
self.__googleTokenKeyRetireTime = time.time() + self.__googleTokenKeyUpdataTime

def __getGoogleTokenKey(self):
result = ''
try:
res = requests.get('https://' + self._host, timeout = 3)
res.raise_for_status()
self.__cookies = res.cookies
result = re.search(r'tkk\:\'(\d+\.\d+)?\'', res.text).group(1) #自动从源代码中获取tkk,或者这里改成一个定值也可以
except requests.exceptions.ReadTimeout as ex:
print('ERROR: ' + str(ex))
time.sleep(1)
return result

def __getGoogleToken(self, a, TKK): #获取tk
def RL(a, b):
for d in range(0, len(b)-2, 3):
c = b[d + 2]
c = ord(c[0]) - 87 if 'a' <= c else int(c)
c = a >> c if '+' == b[d + 1] else a << c
a = a + c & 4294967295 if '+' == b[d] else a ^ c
return a

g = []
f = 0
while f < len(a):
c = ord(a[f])
if 128 > c:
g.append(c)
else:
if 2048 > c:
g.append((c >> 6) | 192)
else:
if (55296 == (c & 64512)) and (f + 1 < len(a)) and (56320 == (ord(a[f+1]) & 64512)):
f += 1
c = 65536 + ((c & 1023) << 10) + (ord(a[f]) & 1023)
g.append((c >> 18) | 240)
g.append((c >> 12) & 63 | 128)
else:
g.append((c >> 12) | 224)
g.append((c >> 6) & 63 | 128)
g.append((c & 63) | 128)
f += 1

e = TKK.split('.')
h = int(e[0]) or 0
t = h
for item in g:
t += item
t = RL(t, '+-a^+6')
t = RL(t, '+-3^+b+-f')
t ^= int(e[1]) or 0
if 0 > t:
t = (t & 2147483647) + 2147483648
result = t % 1000000
return str(result) + '.' + str(result ^ h)


def translate(self, text): #获取返回的json中的结果
if time.time() > self.__googleTokenKeyRetireTime:
self.__updateGoogleTokenKey()
data = {'q': text}
self._params['tk'] = self.__getGoogleToken(text, self.__googleTokenKey)
result = ''
try:
res = requests.post(self._url,
headers = self._headers,
cookies = self.__cookies,
data = data,
params = self._params,
timeout = 6)
res.raise_for_status()
jsonText = res.text
#print(jsonText)
if len(jsonText)>0:
jsonResult = json.loads(jsonText)
if len(jsonResult[0])>0:
for item in jsonResult[0]:
result += item[0]
return result
except Exception as ex:
print('ERROR: ' + str(ex))
return ''

def Translate(text,to):
translator = GoogleTranslator(dest=to)
result = translator.translate(text)
return result

print(Translate(input(': '),'zh-CN')) #自动检测语言到中文翻译

以上代码只是获取了翻译结果,在返回的json中其实还包含了很多其它的信息,可以自行扩展此代码。
最后,附上语言对应的代码(sl和tl所用):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
source_code_name: [{
code: 'auto',
name: '自动检测语言'
},
{
code: 'sq',
name: '阿尔巴尼亚语'
},
{
code: 'ar',
name: '阿拉伯语'
},
{
code: 'am',
name: '阿姆哈拉语'
},
{
code: 'az',
name: '阿塞拜疆语'
},
{
code: 'ga',
name: '爱尔兰语'
},
{
code: 'et',
name: '爱沙尼亚语'
},
{
code: 'or',
name: '奥里亚语(奥里亚文)'
},
{
code: 'eu',
name: '巴斯克语'
},
{
code: 'be',
name: '白俄罗斯语'
},
{
code: 'bg',
name: '保加利亚语'
},
{
code: 'is',
name: '冰岛语'
},
{
code: 'pl',
name: '波兰语'
},
{
code: 'bs',
name: '波斯尼亚语'
},
{
code: 'fa',
name: '波斯语'
},
{
code: 'af',
name: '布尔语(南非荷兰语)'
},
{
code: 'tt',
name: '鞑靼语'
},
{
code: 'da',
name: '丹麦语'
},
{
code: 'de',
name: '德语'
},
{
code: 'ru',
name: '俄语'
},
{
code: 'fr',
name: '法语'
},
{
code: 'tl',
name: '菲律宾语'
},
{
code: 'fi',
name: '芬兰语'
},
{
code: 'fy',
name: '弗里西语'
},
{
code: 'km',
name: '高棉语'
},
{
code: 'ka',
name: '格鲁吉亚语'
},
{
code: 'gu',
name: '古吉拉特语'
},
{
code: 'kk',
name: '哈萨克语'
},
{
code: 'ht',
name: '海地克里奥尔语'
},
{
code: 'ko',
name: '韩语'
},
{
code: 'ha',
name: '豪萨语'
},
{
code: 'nl',
name: '荷兰语'
},
{
code: 'ky',
name: '吉尔吉斯语'
},
{
code: 'gl',
name: '加利西亚语'
},
{
code: 'ca',
name: '加泰罗尼亚语'
},
{
code: 'cs',
name: '捷克语'
},
{
code: 'kn',
name: '卡纳达语'
},
{
code: 'co',
name: '科西嘉语'
},
{
code: 'hr',
name: '克罗地亚语'
},
{
code: 'ku',
name: '库尔德语'
},
{
code: 'la',
name: '拉丁语'
},
{
code: 'lv',
name: '拉脱维亚语'
},
{
code: 'lo',
name: '老挝语'
},
{
code: 'lt',
name: '立陶宛语'
},
{
code: 'lb',
name: '卢森堡语'
},
{
code: 'rw',
name: '卢旺达语'
},
{
code: 'ro',
name: '罗马尼亚语'
},
{
code: 'mg',
name: '马尔加什语'
},
{
code: 'mt',
name: '马耳他语'
},
{
code: 'mr',
name: '马拉地语'
},
{
code: 'ml',
name: '马拉雅拉姆语'
},
{
code: 'ms',
name: '马来语'
},
{
code: 'mk',
name: '马其顿语'
},
{
code: 'mi',
name: '毛利语'
},
{
code: 'mn',
name: '蒙古语'
},
{
code: 'bn',
name: '孟加拉语'
},
{
code: 'my',
name: '缅甸语'
},
{
code: 'hmn',
name: '苗语'
},
{
code: 'xh',
name: '南非科萨语'
},
{
code: 'zu',
name: '南非祖鲁语'
},
{
code: 'ne',
name: '尼泊尔语'
},
{
code: 'no',
name: '挪威语'
},
{
code: 'pa',
name: '旁遮普语'
},
{
code: 'pt',
name: '葡萄牙语'
},
{
code: 'ps',
name: '普什图语'
},
{
code: 'ny',
name: '齐切瓦语'
},
{
code: 'ja',
name: '日语'
},
{
code: 'sv',
name: '瑞典语'
},
{
code: 'sm',
name: '萨摩亚语'
},
{
code: 'sr',
name: '塞尔维亚语'
},
{
code: 'st',
name: '塞索托语'
},
{
code: 'si',
name: '僧伽罗语'
},
{
code: 'eo',
name: '世界语'
},
{
code: 'sk',
name: '斯洛伐克语'
},
{
code: 'sl',
name: '斯洛文尼亚语'
},
{
code: 'sw',
name: '斯瓦希里语'
},
{
code: 'gd',
name: '苏格兰盖尔语'
},
{
code: 'ceb',
name: '宿务语'
},
{
code: 'so',
name: '索马里语'
},
{
code: 'tg',
name: '塔吉克语'
},
{
code: 'te',
name: '泰卢固语'
},
{
code: 'ta',
name: '泰米尔语'
},
{
code: 'th',
name: '泰语'
},
{
code: 'tr',
name: '土耳其语'
},
{
code: 'tk',
name: '土库曼语'
},
{
code: 'cy',
name: '威尔士语'
},
{
code: 'ug',
name: '维吾尔语'
},
{
code: 'ur',
name: '乌尔都语'
},
{
code: 'uk',
name: '乌克兰语'
},
{
code: 'uz',
name: '乌兹别克语'
},
{
code: 'es',
name: '西班牙语'
},
{
code: 'iw',
name: '希伯来语'
},
{
code: 'el',
name: '希腊语'
},
{
code: 'haw',
name: '夏威夷语'
},
{
code: 'sd',
name: '信德语'
},
{
code: 'hu',
name: '匈牙利语'
},
{
code: 'sn',
name: '修纳语'
},
{
code: 'hy',
name: '亚美尼亚语'
},
{
code: 'ig',
name: '伊博语'
},
{
code: 'it',
name: '意大利语'
},
{
code: 'yi',
name: '意第绪语'
},
{
code: 'hi',
name: '印地语'
},
{
code: 'su',
name: '印尼巽他语'
},
{
code: 'id',
name: '印尼语'
},
{
code: 'jw',
name: '印尼爪哇语'
},
{
code: 'en',
name: '英语'
},
{
code: 'yo',
name: '约鲁巴语'
},
{
code: 'vi',
name: '越南语'
},
{
code: 'zh-CN',
name: '中文'
}]