¸üÐÂʱ¼ä:2021Äê06ÔÂ21ÈÕ14ʱ56·Ö À´Ô´:ÀÖÓãµç¾º ä¯ÀÀ´ÎÊý:

lxmlÊÇʹÓÃPythonÓïÑÔ±àдµÄ¿â£¬Ö÷ÒªÓÃÓÚ½âÎöºÍÌáÈ¡HTML»òÕßXML¸ñʽµÄÊý¾Ý£¬Ëü²»½ö¹¦Äܷdz£·á¸»£¬¶øÇÒ±ãÓÚʹÓ㬿ÉÒÔÀûÓÃXPathÓï·¨¿ìËٵض¨Î»Ìض¨µÄÔªËØ»ò½Úµã¡£
lxml¿âÖд󲿷ֹ¦Äܶ¼Î»ÓÚlxml.etreeÄ£¿éÖУ¬µ¼Èëlxml.etreeÄ£¿éµÄ³£¼û·½Ê½ÈçÏ£º
from lxml import etree
lxml¿âµÄһЩÏà¹ØÀàÈçÏ£º
(1) ElementÀࣺ¿ÉÒÔÀí½âΪXMLµÄ½Úµã¡£
(2) ElementTreeÀࣺ¿ÉÒÔÀí½âΪһ¸öÍêÕûµÄXMLÎĵµÊ÷¡£
(3) ElementPathÀࣺ¿ÉÒÔÀí½âΪXPath,ÓÃÓÚËÑË÷ºÍ¶¨Î»½Úµã¡£
1.Element Àà¼ò½é
ElementÀàÊÇXML´¦ÀíµÄºËÐÄÀ࣬¿ÉÒÔÖ±¹ÛµØÀí½âΪXMLµÄ½Úµã£¬´ó²¿·ÖXML½ÚµãµÄ´¦Àí¶¼ÊÇÎ§ÈÆ×ÅElementÀà½øÐеġ£ÒªÏë´´½¨Ò»¸ö½Úµã¶ÔÏó£¬Ôò¿ÉÒÔͨ¹ý¹¹Ô캯ÊýÖ±½Ó´´½¨¡£ÀýÈ磺
root=etree.Element('root')
ÉÏÊöʾÀýÖУ¬²ÎÊýroot±íʾ½ÚµãµÄÃû³Æ¡£ ¹ØÓÚElementÀàµÄÏà¹Ø²Ù×÷£¬Ö÷Òª¿É·ÖΪÈý²¿·Ö£¬·Ö±ðÊǽڵã²Ù×÷¡¢½ÚµãÊôÐԵIJÙ×÷¡¢½ÚµãÄÚÎı¾µÄ²Ù×÷£¬ÏÂÃæ½øÐÐÖðÒ»½éÉÜ¡£ £¨1£©½Úµã²Ù×÷£ºÈôÒª»ñÈ¡½ÚµãµÄÃû³Æ£¬¿ÉÒÔͨ¹ýtagÊôÐÔ»ñÈ¡¡£ÀýÈ磺
print(root.tag) # Êä³ö½á¹ûÈçÏ root
£¨2£©½ÚµãÊôÐԵIJÙ×÷:ÔÚ´´½¨½ÚµãµÄͬʱ£¬¿ÉÒÔΪ½ÚµãÔö¼ÓÊôÐÔ¡£½ÚµãÖеÄÊôÐÔÊÇÒÔkey-valueµÄÐÎʽ½øÐд洢µÄ£¬ÀàËÆÓÚ×ÖµäµÄ´æ´¢·½Ê½¡£Í¨¹ý¹¹Ôì·½·¨´´½¨½Úµãʱ£¬¿ÉÒÔÔڸ÷½·¨ÖÐÒÔ²ÎÊýµÄÐÎʽÉèÖÃÊôÐÔ£¬ÆäÖвÎÊýµÄÃû³Æ±íʾÊôÐÔµÄÃû³Æ£¬²ÎÊýµÄÖµ±íʾΪÊôÐÔµÄÖµ¡£´´½¨ÊôÐÔµÄʾÀýÈçÏ£º
# ´´½¨root½Úµã£¬²¢ÎªÆäÌí¼ÓÊôÐÔ
root=etree.Element('root', interesting='totally')
print(etree.tostring(root))
# Êä³ö½á¹ûÈçÏÂ
b'<root interesting=" totally" />'
´ËÍ⣬¿ÉÒÔͨ¹ýset()·½·¨¸øÒÑÓеĽڵãÌí¼ÓÊôÐÔ¡£ÔÚµ÷Óø÷½·¨Ê±¿ÉÒÔ´«ÈëÁ½¸ö²ÎÊý£¬ÆäÖеÚÒ»¸ö²ÎÊý±íʾÊôÐÔµÄÃû³Æ£¬µÚ¶þ¸ö²ÎÊý±íʾÊôÐÔµÄÖµ¡£ÀýÈ磺
# Ôٴθøroot½ÚµãÌí¼ÓageÊôÐÔ
root.set('age', '30')
print(etree.tostring(root))
# Êä³ö½á¹ûÈçÏÂ
b'<root interesting="totally"age="30"/>'
ÔÚÉÏÊöÁ½¸öʾÀýÖУ¬¶¼Óõ½ÁËtostring()º¯Êý£¬¸Ãº¯Êý¿ÉÒÔ½«ÔªËØÐòÁл¯ÎªXMLÊ÷µÄ±àÂë×Ö·û´®±íʾÐÎʽ¡£
£¨3£©½ÚµãÄÚÎı¾µÄ²Ù×÷£ºÒ»°ãÇé¿öÏ£¬¿ÉÒÔͨ¹ýtext¡¢tailÊôÐÔ»òÕßxpath()·½·¨À´·ÃÎÊÎı¾ÄÚÈÝ¡£Í¨¹ýtextÊôÐÔ·ÃÎʽڵãµÄʾÀýÈçÏÂ:
root=etree.Element('root') # ´´½¨root½Úµã
root.text='Hello, World!' # ¸øroot½ÚµãÌí¼ÓÎı¾
print(root.text)
print(etree.tostring(root))
# Êä³ö½á¹ûÈçÏÂ
Hello, world!
b'<root>Hello, World!</root>'
2.´Ó×Ö·û´®»òÎļþÖнâÎöXML
ΪÁËÄܹ»½«XMLÎļþ½âÎöΪÊ÷½á¹¹£¬etreeÄ£¿éÖÐÌṩÁËÈçÏÂ3¸öº¯Êý£º (1 ) fromstring()º¯Êý£º´Ó×Ö·û´®ÖнâÎöXMLÎĵµ»òƬ¶Î£¬·µ»Ø¸ù½Úµã(»ò½âÎöÆ÷Ä¿±ê·µ»ØµÄ½á¹û)¡£ (2) XML()º¯Êý£º´Ó×Ö·û´®³£Á¿ÖнâÎöXMLÎĵµ»òƬ¶Î£¬·µ»Ø¸ù½Úµã(»ò½âÎöÆ÷Ä¿±ê·µ»ØµÄ½á¹û)¡£ (3) HTML()º¯Êý£º´Ó×Ö·û´®³£Á¿ÖнâÎöHTMLÎĵµ»òƬ¶Î£¬·µ»Ø¸ù½Úµã(»ò½âÎöÆ÷Ä¿±ê·µ»ØµÄ½á¹û)¡£ ÆäÖУ¬XML()º¯ÊýµÄÐÐΪÀàËÆÓÚfromstring0º¯Êý£¬Í¨³£ÓÃÓÚ½«XML×ÖÃæÁ¿Ö±½ÓдÈëµ½Ô´´úÂëÖУ»HTML()º¯Êý¿ÉÒÔ×Ô¶¯²¹È«È±ÉÙµÄ<html>ºÍ<body>±êÇ©¡£ÒÔÉÏ3¸öº¯ÊýµÄʾÀýÈçÏ£º
xml_data='<root>data</root>' # fromstring()·½·¨ root_one=etree.fromstring(xml_data) print(root_one.tag) print(etree.tostring(root_one)) # XML·½·¨£¬Óëfromstring·½·¨»ù±¾Ò»Ñù root_two=etree.XML(xml_data) print(root_two.tag) print(etree.tostring(root_two)) # HTML()·½·¨£¬Èç¹ûûÓÐ<html>ºÍ<body>±êÇ©£¬»á×Ô¶¯²¹ÉÏ root_three=etree.HTML(xml_data) print(root_three.tag) print(etree.tostring(root_three)) ³ÌÐòÔËÐнá¹ûΪ£º root b'<root>data</root>' root b'<root>data</root>' html b'<html><body><root>data</root></body></html>'
³ýÁËÉÏÊö3¸öº¯ÊýÖ®Í⣬»¹¿ÉÒÔµ÷ÓÃparse()º¯Êý´ÓXMLÎļþÖÐÖ±½Ó½âÎö¡£ÔÚµ÷Óú¯Êýʱ£¬Èç¹ûûÓÐÌṩ½âÎöÆ÷£¬ÔòʹÓÃĬÈϵĽâÎöÆ÷£¬º¯Êý»á·µ»ØÒ»¸öElemenfTree ÀàµÄ¶ÔÏó¡£ÀýÈ磺
html=etree.parse('./hello.html')
result=etree.tostring(html, pretty_print=True)
ElementPathÀà¼ò½é
ElementTreeÀàÖи½´øÁËÒ»¸öÀàËÆÓÚXPath·¾¶ÓïÑÔµÄElementPathÀà¡£ÔÚElementTreeÀà»òElementsÀàµÄAPIÎĵµÖУ¬ÌṩÁË3¸ö³£Óõķ½·¨£¬¿ÉÒÔÂú×ã´ó²¿·ÖËÑË÷ºÍ²éѯÐèÇ󣬲¢ÇÒÕâ3¸ö·½·¨µÄ²ÎÊý¶¼ÊÇXPathÓï¾ä¡£¾ßÌåÈçÏ£º (1) find()·½·¨£º·µ»ØÆ¥Åäµ½µÄµÚÒ» ¸ö×ÓÔªËØ¡£ (2) findall()·½·¨£ºÒÔÁбíµÄÐÎʽ·µ»ØËùÓÐÆ¥ÅäµÄ×ÓÔªËØ¡£ (3) iterfind()·½·¨£º·µ»ØÒ»¸öËùÓÐÆ¥ÅäÔªËØµÄµü´úÆ÷¡£ ´ÓÎĵµÊ÷µÄ¸ù½Úµã¿ªÊ¼£¬ËÑË÷·ûºÏÒªÇóµÄ½Úµã¡£ÀýÈ磺
# ´Ó×Ö·û´®ÖнâÎöXML,·µ»Ø¸ù½Úµã
root=etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
# ´Ó¸ù½Úµã²éÕÒ£¬·µ»ØÆ¥Åäµ½µÄ½ÚµãÃû³Æ
print(root.find("a").tag)
# ´Ó¸ù½Úµã¿ªÊ¼²éÕÒ£¬·µ»ØÆ¥Åäµ½µÄµÚÒ»¸ö½ÚµãµÄÃû³Æ
print(root.findall(".//a[@x]")[0].tag)
³ÌÐòÔËÐнá¹ûΪ£º
a A
»¹¿ÉÒÔµ÷ÓÃxpath()·½·¨£¬Ê¹ÓÃÔªËØ×÷ΪÉÏÏÂÎĽڵãÀ´ÆÀ¹ÀXPath±í´ïʽ¡£
lxml¿âµÄ»ù±¾Ê¹ÓÃ
ÕâÀïʹÓÃÒ»¸öHTMLʾÀýÎļþ×÷ÎªËØ²ÄÀ´½éÉÜlxml¿âµÄ»ù±¾Ó¦Ó᣸ÃÎļþÃûΪhello.html£¬ÄÚÈÝÈçÏ£º
<!-- hello.html -->
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
°´ÏÂÀ´£¬»ùÓÚÉÏÊöHTMLÎĵµ£¬Ê¹ÓÃlxml¿âÖеÄ·¾¶±í´ïʽ¼¼ÇÉ£¬Í¨¹ýµ÷ÓÃxpath()·½·¨Æ¥ÅäѡȡµÄ½Úµã£¬¾ßÌåÈçÏ£º
»ñÈ¡ÈÎÒâλÖõÄli½Úµã ¿ÉÒÔÖ±½ÓʹÓÓ//”´ÓÈÎÒâλÖÃѡȡ½Úµãli£¬Â·¾¶±í´ïʽÈçÏ£º
//li
ͨ¹ýlxml.etreeÄ£¿éµÄxpath()·½·¨£¬½«hello.htmlÎļþÖÐÓë¸Ã·¾¶±í´ïʽƥÅäµ½µÄÁÐ±í·µ»Ø£¬²¢´òÓ¡Êä³ö¡£¾ßÌå´úÂëÈçÏ£º
from lxml import etree
html=etree.parse('hello.html')
# ²éÕÒËùÓеÄli½Úµã
result=html.xpath('//li')
# ´òÓ¡<li>±êÇ©µÄÔªËØ¼¯ºÏ
print(result)
# ´òÓ¡<li>±êÇ©µÄ¸öÊý
print(len(result))
# ´òÓ¡·µ»Ø½á¹ûµÄÀàÐÍ
print(type(result))
# ´òÓ¡µÚÒ»¸öÔªËØµÄÀàÐÍ
print(type(result[0]))
³ÌÐòÔËÐнá¹ûΪ£º
[<Element li at 0x2cc9a48>, <Element li at 0x2cc99c8>, <Element li at 0x2cc9a88>, <Element li at 0x2cc9ac8>, <Element li at 0x2cc9b08>] 5 <class 'list'> <class 'lxml.etree._Element'>
¼ÌÐø»ñÈ¡<li>±êÇ©µÄclassÊôÐÔ
ÔÚÉϸö±í´ïʽµÄĩβ£¬Ê¹ÓÓ/”ÏòÏÂѡȡ½Úµã£¬²¢Ê¹ÓÃ@ѡȡclassÊôÐԽڵ㣬±í´ïʽÈçÏ£º
//1i/@class
»ñÈ¡<li>±êÇ©µÄclassÊôÐÔµÄʾÀý´úÂëÈçÏ£º
from lxml import etree
html=etree.parse('hello.html')
# ²éÕÒλÓÚli±êÇ©µÄclassÊôÐÔ
result=html.xpath('//li/@class')
print(result)
³ÌÐòÔËÐнá¹ûΪ£º
['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
»ñÈ¡µ¹ÊýµÚ¶þ¸öÔªËØµÄÄÚÈÝ
´ÓÈÎÒâλÖÿªÊ¼Ñ¡È¡µ¹ÊýµÚ¶þ¸ö<li>±êÇ©£¬ÔÙÏòÏÂѡȡ±êÇ©<a>¡£Èç¹ûÒª»ñÈ¡¸Ã±êÇ©ÖÐµÄ Îı¾£¬¿ÉÒÔʹÓÃÈçϱí´ïʽ£º
//li[last()-1]/a
»òÕß
//li[last()-1]/a]/text()
²»Í¬µÄÊÇ£¬µÚ¸ö±í´ïʽÐèÒª·ÃÎÊtextÊôÐÔ£¬²ÅÄÜÄõ½±êÇ©µÄÎı¾£¬¶øµÚ¶þ¸ö±í´ïʽ¿ÉÖ± ½Ó»ñÈ¡Îı¾¡£Ê¹ÓõÚÒ» ¸ö·¾¶±í´ïʽµÄʾÀýÈçÏ£º
from lxml import etree
html=etree.parse('hello.html')
# »ñÈ¡µ¹ÊýµÚ¶þ¸öÔªËØµÄÄÚÈÝ
result=html.xpath('//li[last()-1]/a')
print(result[0].text)
³ÌÐòÔËÐнá¹û£º
fourth item
±±¾©Ð£Çø